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Abstract 

Abstract —  This  paper  is  concerned  with  the  mapping  of  algorithms  structured  as  depth  p 
nested  for  loops  into  special  purpose  systolic  VLSI  linear  arrays.  The  mappings  are  done  by  using 
linear  functions  transforming  the  original  sequential  algorithms  into  a  form  suitable  for  parallel 
execution  on  linear  arrays.  The  derivation  of  feasible  mapping  is  done  by  identifying  formal  criteria 
to  be  satisfied  by  both  the  original  sequential  algorithm  and  proposed  transformation  function. 
Those  formal  criteria  define  the  universe  of  feasible  solutions  and  thus  enable  us  to  derive  large 
families  of  transformations;  the  target  transformation  can  be  then  chosen  using  additional  criteria. 
Among  such  criteria  could  be:  minimal  execution  time,  smallest  number  of  processors  to  be  used,  or 
the  requirement  to  use  a  processor  with  specific  characteristics  provided  to  us.  We  also  study  issues 
dealing  with  modular  extensibility  (using  one  type  of  processor  for  arrays  of  various  length)  and 
partitioning  (using  arrays  that  are  small  to  solve  large  problems).  The  methodology,  which  deals 
with  general  algorithms,  is  illustrated  by  synthesizing  families  of  algorithms  for  matrix  multiplication 
and  a  version  of  the  Warshall-Floyd  transitive  closure  algorithm. 

Index-Terms —  VLSI,  linear  systolic  array,  algorithm  transformations,  hyperplane,  parallel 
processing,  data  dependence,  data  contention,  modularly  extensible,  partition  model,  matrix  mul- 
tiplication, path-finding  problems. 
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1      Introduction 

This  paper  is  concerned  with  designing  special  purpose  VLSI  chips  to  implement  particular  algo- 
rithms. Kung,  Hwang  and  Briggs.  Mead  and  Conway,  Ullman  and  other  researchers  pointed  out 
that  it  will  be  beneficial  to  design  systolic  algorithms  [16]  [24]  [10]  [36],  which  are  especially  suitable 
for  implementation  as  VLSI  chips.  For  systolic  algorithms,  consult  [16]  [9]  [33]  [18]  [19]  [21]  [42]  [6]. 
Implementation  should  be  as  efficient  as  possible  in  terms  of  the  space  and  time  resources  required. 
Area-Time  trade-offs  were  studied  by  Thompson,  Brent  and  Kung,  Kedem  [35]  [5]  [11]  and  others. 

A  systolic  array  is  a  special-purpose  parallel  device,  made  out  of  a  few  simple  processing-element 
(PE)  types  whose  interconnection  pattern  exhibits  regularity  and  locality.  Several  array  structures 
have  been  proposed,  including  linear  arrays,  mesh-connected  arrays,  and  hexagonal  arrays.  In  a 
typical  application,  such  arrays  would  be  attached  as  peripheral  devices  to  a  host  computer  which 
inserts  input  values  into  them  and  extracts  output  values  from  them  [16]. 

We  will  study  linear  systolic  arrays  in  this  paper,  as  linear  arrays  are  attractive  for  their  bounded 
I/O  requirements  and  a  simple  global  clock  whose  rate  is  independent  of  the  size  of  the  array.  We 
will  consider  the  important  class  of  algorithms  structured  as  (depth)  p  nested  for  loops.  VLSI 
implementation  of  such  algorithms  were  studied  before,  but  it  is  our  goal  to  present  a  systematic 
method  for  transforming  them  into  linear  arrays. 

The  class  of  a  general  p  nested  for  loop  algorithms  includes  algorithms  to  solve  matrix  multi- 
plication, L-U  decomposition,  matrix  inversion,  linear  systems,  path-finding  problem  [1]  (including 
all  shortest  path  problem  and  transitive  closure  problem),  pattern  matching,  bubble  sort,  a  version 
of  Discrete  Fourier  Transform  (DFT)  [24],  convolution  [16],  certain  problems  solvable  by  dynamic 
programming,  and  others.  These  problems  are  of  great  importance  in  scientific  computation. 

Kung  [14]  [15]  [16]  first  found  that  the  regularity  and  locality  properties  of  many  algorithms 
make  them  suitable  VLSI  systolic  array  implementations.  Among  them  were  matrix  multiplication, 
matrix-vector  multipHcation.  L-U  decomposition,  and  others.  Hwang  and  Cheng,  Rote,  Ma  et  al, 
and  Hochet  [9]  [34]  [23]  [8]  and  others  found  systolic  array  algorithms  for  solving  linear  systems, 
path-finding  problems,  matrix  inversion,  and  others.  These  pioneering  works,  which  increased  the 
understanding  of  regularity  and  locality  properties  of  algorithms,  were  based  on  the  characteristics 
of  an  individual  algorithm  under  consideration. 

Systematic  synthesis  methodologies  were  also  proposed.  Kung  [18],  Quinton  [30],  and  Lin  and 
Wah  [22]  synthesized  systolic  array  directly  from  uniform  recurrent  equations.  They  mapped  p 
nested  loop  algorithms  onto  (p-  1)-D  systolic  arrays.  However,  their  method  was  limited  in  scope 
because  they  could  not  find  an  easy  way  to  determine  the  entrance  time  of  input  variables  and  to 
obtain  families  of  systohc  implementations. 

Kuhn  [12],  Moldovan  [26]  [27]  [28]  [29].  Miranker  and  Winkler  [25],  Wong  and  Delosme  [41] 


1     INTRODUCTION  3 

synthesized  systolic  array  based  on  the  hyperplane  method  introduced  by  Lamport  [20].  They 
found  the  data-dependence  vectors  [20]  of  an  algorithm  first,  and  then  they  found  a  nonsingular 
linear  mapping  preserving  the  data-dependence  ordering  of  the  original  algorithm.  This  method  is 
also  called  space-time  mapping. 

In  their  model,  Kuhn  [12]  and  Moldovan  [26]  [27]  mapped  a  p  dimensional  problem  space  to  a  t 
loops  1-D  time  hyperplane  and  an  s-D  space  hyperplane  mapping,  where  p  =  t+  s.  Moldovan  [28] 
[29],  and  Miranker  and  Winkler  [25]  mapped  a  p  dimensional  problem  space  to  a  1-D  time  hyper- 
plane and  (p  -  1)-D  space  hyperplane  mapping.  Wong  and  Delosme  [41]  mapped  a  p  dimensional 
problem  space  to  a  p-  2  loops  1-D  time  hyperplane  and  2-D  space  hyperplane  mapping,  they  then 
mapped  this  p  -  2  loops  1-D  time  hyperplane  to  a  1-D  clock  ticks. 

As  shown  by  examples  in  those  papers,  the  results  were  in  practice  useful  for  mapping  p  nested 
loop  algorithms,  for  p  >  2,  on  higher  than  1-D  arrays.  In  fact,  there  is  no  example  given  of  mapping 
a  p  nested  loop  algorithm  into  a  linear  array  when  p  >  2. 

Another  open  area  left  by  the  previous  work  was  the  formulation  of  necessary  and  sufficient 
conditions  for  correct  mappings.  As  an  example  we  can  observe  that  even  though  using  the  previous 
results  it  was  possible  to  guarantee  that  indices  are  mapped  properly  on  PEs,  there  was  no  formal 
way  of  guaranteeing  that  tokens  do  not  clash  in  the  PEs. 

Ramakrishnan  et  al.  [33]  mapped  2-D  and  3-D  homogeneous  graphs  on  linear  arrays.  They 
provided  certain  necessary  conditions  for  a  correct  mapping,  but  they  did  not  provide  a  complete 
set  of  sufficient  conditions.  As  a  result,  they  could  not  synthesize  the  linear  array  implementations 
of  Ramakrishnan  and  Varman  [32]. 

We  now  proceed  to  examine  the  1/0  complexity  of  common  systolic  array  implementations  of 
the  important  algorithms  we  mentioned  above  (matrix  multiplication,  etc.).  The  algorithms  were 
usually  mapped  into  (p  -  1)-D  arrays,  see  works  of  H.  T.  Kung,  Leiserson,  Mead,  Conway,  Kuhn, 
Moldovan,  Fortes.  Rote,  Hochet.  and  S.  Y.  Kung:  [14]  [15]  [16]  [24]  [12]  [26]  [27]  [28]  [29]  [34]  [8]  [19]. 
The  arrays  were  of  size  0(n''~M  and  the  execution  time  was  generally  0(n),  resulting  in  optimal 
processor/time  product  of  0(7?^).  However,  the  number  of  pins  required  was  at  least  Q{n^~^),  that 
is,  it  increased  with  the  size  of  the  problem  solved,  if  p  >  2.  This  may  cause  potential  difficulties 
during  integration  of  the  systolic  array  with  the  host  computer. 

Our  goal  is  to  map  p  nested  for  loop  algorithms  on  systolic  arrays  with  constant  number  of  pins 
(bounded  I/O  requirements),  independent  of  the  problem  (or  array)  size.  We  will  thus  synthesize 
linear  array  systoUc  algorithms  with  the  following  complexity  properties.  For  the  set  of  problems 
listed  above,  the  execution  time  will  be  0(7!^"^ )  and  the  number  of  I/O  pins  will  be  constant.  We 
will  need  0{n^~^ )  storage  locations.  There  are  various  way  of  deciding  on  the  appropriate  number 
of  PE's.  It  is  possible  to  use  only  0(n)  PE's,  thus  obtaining  an  optimal  processor/time  product 
of  0(7?'').   This,  however,  requires  storage  of  0(71^"'^)  in  each  PE.  Thus,  if  we  want  the  array  to 


1     INTRODUCTION  4 

be  modularly  extensible,  that  is  for  each  PE  to  have  constant  amount  of  storage,  we  may  actually 
prefer  in  practice  to  use  0{n'''~^)  PE's. 

To  accomplish  the  above  in  a  systematic  manner,  we  present  a  methodology  for  mapping  p 
nested  for  loop  algorithms  into  linear  arrays.  A  mapping  is  derived  by  using  a  function  transforming 
the  original  sequential  algorithm  into  a  form  suitable  for  parallel  execution  on  a  linear  array.  Our 
approach,  similarly  to  Kuhn's  and  Moldovan's,  is  based  on  Lamport's  hyperplane  method  [20]. 
However,  we  transform  a  p  nested  for  loop  algorithm  into  a  1-D  time  hyperplane  and  a  1-D  space 
hyperplane  linear-array  algorithm. 

We  also  find  the  data-dependence  vectors  first.  However,  we  classify  them  into  three  types 
based  on  certain  formal  properties.  This  classification  of  data-dependence  vectors  allows  us  to 
formulate  conditions  on  the  target  linear  array  down  to  the  register  level.  As  data-dependence 
constraints  provide  more  information  than  the  standard  directed  graph  representations  [18]  [19] 
[33],  our  method  can  provide  more  implementations  details. 

A  mapping  of  a  p  nested  for  loop  algorithm  on  a  linear  array  must  satisfy  certain  constraints  in 
order  to  assure  correct  flows  of  token  streams.  In  this  paper  we  list  formal  necessary  and  sufficient 
conditions  to  be  satisfied  by  the  mapping  assigning  tokens  to  the  PE's  at  various  time  instances,  so 
that  the  resulting  computation  is  physically  (geometrically)  feasible.  Those  necessary  and  sufficient 
conditions  are  not  reducible  to  the  results  of  Kuhn,  Moldovan,  Miranker  and  Winkler,  and  Wong 
and  Delosme. 

This  follows  from  the  fact  that  our  necessary  and  sufficient  conditions  encompass  several  dif- 
ferent aspects  of  the  design.  In  addition  to  preservation  of  data-dependence  and  nonsingularity  of 
the  mapping  (as  done  by  previous  researchers)  they  also  include  prevention  of  data  contention  and 
complexity  of  PE's  hardware. 

As  an  added  benefit,  this  set  of  necessary  and  sufficient  conditions  allows  us  to  study  large 
classes  of  mappings,  as  we  are  able  to  find  families  of  solutions  satisfying  the  conditions  we  derive. 
Among  the  feasible  solutions  we  can  choose  some  based  on  optimality  criteria,  such  as  minimum 
execution  time  or  smallest  number  of  PE's  used. 

We  also  analyze  time  complexity  and  the  storage  complexity  of  linear  array  implementations. 
For  a  class  of  algorithms  we  study  tight  bounds  of  time  complexity  of  the  linear  array  implementa- 
tions. Such  algorithms  include  matrix  multiplication,  L-U  decomposition,  inversion  of  nonsingular 
triangular  matrix,  matrix  orthogonal  triangularization.  a  version  of  transitive  closure  algorithm 
[7],  DFT,  bubble  sort,  and  others.  We  also  provide  a  technique  to  determine  whether  linear  array 
implementations  of  these  algorithms  have  both  the  optimal  time  complexity  and  storage  complexity. 

There  may  be  additional  concerns  that  one  may  want  to  consider  in  choosing  a  specific  linear 
array.  For  instance,  we  may  consider  relations  between  the  number  of  PE's  and  the  number  of 
registers  in  a  PE.  If  the  number  of  registers  in  a  PE  is  constant,  then  arrays  of  various  sizes  can 
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be  build  with  a  single  type  of  PE,  resulting  in  an  modularly  extensible  array.  We  may  cdso  be 
provided  with  some  "standard"  PE  type  and  be  required  to  design  a  linear  array  using  the  specific 
number  of  storage  cells  in  the  PE's.  Our  method  allows  us  to  understand  the  relations  between 
such  features  of  the  provided  PE's  and  the  linear  arrays  that  can  be  designed. 

Finally,  we  are  interested  in  designing  algorithms  that  can  be  partitioned.  Generally,  the  number 
of  PE's  needed  for  the  linear  array  solving  some  problem  grows  with  the  size  of  the  problem. 
We  may,  however,  be  restricted  to  arrays  of  certain  size  while  still  being  required  to  solve  large 
problems.  To  accomplish  that,  it  is  necessary  to  partition  the  algorithm  so  that  the  token  streams 
travel  through  the  array  more  than  once.  Partitioning  of  algorithms  was  considered  before  by 
Guibas  et  al.  [7],  Hwang  and  Cheng  [9],  Moldovan  and  Fortes  [28],  Annaratone  et  al.  [2],  and 
others.  Our  method  for  partitioning  is  quite  general,  but  we  need  to  impose  additional  restrictions 
on  the  original  "unpartitioned"  families  of  solutions  before  we  can  partition  them.  Note  that  as 
the  unpartitioned  algorithm  used  only  a  (small)  constant  numbers  of  pins,  I/O  considerations  by 
themselves  do  not  require  partitioning  as  done  by  e.g.  [7]  [9]  [28]. 

As  implicitly  discussed  above,  it  is  useful  to  design  linear  array  implementations  for  2  nested 
loop  algorithms  [24]  [16]  [28]  [29].  Linear-array  algorithms  to  implement  3  nested  for  loop  algo- 
rithms also  have  been  studied  before.  Some  previous  results  on  linear  arrays  for  3  nested  for  loop 
algorithms  are  due  to  [13]  [17]  [31]  [32]  [33]  [40].  The  fundamental  problem  studied  there  was 
matrix  multiplication.  Their  results  fall  within  the  framework  of  the  methodology  described  in 
this  paper.  In  addition,  we  can  also  derive  families  of  implementations  that  were  not  obtained 
before,  by  examining  families  of  solutions  satisfying  the  constraints  we  define.  However,  some 
implementations  using  different  type  of  architecture,  such  as  busses  [38]  [39]  do  not  fall  within  our 
framework. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2  we  introduce  the  general  p  nested  loop 
algorithm  and  the  linear-array  model  we  will  use.  In  Section  3  we  demonstrate  our  methodology  by 
synthesizing  a  matrix-multiplication  algorithm.  In  Section  4.  we  first  introduce  the  classification 
of  data-dependence  vectors  which  allow  us  to  formulate  conditions  on  the  target  linear  array. 
Second,  we  provide  the  necessary  and  sufficient  conditions  so  that  the  mapping  results  in  a  correct 
systolic  linear  array  computation.  Third,  we  analyze  time  and  storage  complexity  of  linear  array 
implementations.  For  an  important  class  of  algorithms,  we  find  tight  bounds  on  the  time  complexity 
of  linear  array  implementations.  We  also  provide  a  technique  to  show  whether  the  linear  array 
implementations  of  these  algorithms  have  both  the  optimal  time  complexity  and  optimal  storage 
complexity.  In  Section  5,  we  discuss  various  optimization  criteria  for  general  families  of  algorithms. 
In  Section  6  we  discuss  the  constraints  required  for  partitioning  of  algorithms.  In  Section  7  we 
compare  the  suitability  of  two  path-finding  algorithms  for  implementation  as  linear  arrays  using  our 
methodology.  The  original  Warshall-Floyd  path-finding  algorithm  is  not  suitable  for  transformation 
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to  a  linear  array;  however,  another  reindexed  path-finding  algorithm  due  to  Kung  [19]  can  be 
transformed  into  linear  array.  Finally,  some  concluding  remarks  are  given  in  Section  8.  In  the 
Appendix,  the  methodology  of  this  paper  is  distilled  into  a  procedure  for  mapping  p  nested  for 
loop  algorithms  into  linear  arrays. 

2      Model  for  Algorithms  and  Linear  Arrays 

2.1      Algorithm  Model 

In  this  paper,  we  consider  the  class  of  algorithms  with  p  nested  loops  of  the  form: 


for 

»i  = 

li  to  uj  by  ki 

for 

j2  =  h  to  U2  by  ^'2 

for  ip  =  Ip  to  Up  by 

kp 

Statementi  ; 

Statement2  ; 

Statementm 

end  .for 

end 

_for 

end 

l_for 

I J  and  Uj  are  integer- valued  linear  expressions  possibly  involving  j'l,  12,  ...,  ij-\  for  I  <  j  <  p. 
Without  loss  of  generality,  we  assume  that  Ij  <  Uj  and  kj  —  1  for  all  \  <  j  <  p.  When  discussing 
complexity  results,  it  will  be  more  convenient  to  use  one  parameter  only  to  describe  the  size  of  the 
problem.  Thus  we  may  write  u,  —  /,  =  0[n)  for  appropriate  n. 

We  make  additional  assumptions  about  the  structure  of  the  statements  inside  the  loop,  as 
done  by  Lamport  [20],  who  first  studied  parallel  executions  of  algorithms  written  as  nested  loops. 
The  statements  in  the  loop  can  be  assignment  statements,  if  statements,  case  statements,  or  even 
repetitive  statements  (including  repeat  statements,  while  statements,  and  for-loop  statements). 
Since  the  goal  of  these  statements  is  to  compute  (modify)  some  data,  each  statement  can  be 
taken  as  a  special  type  of  assignment  statement.  These  statements  contain  no  I/O  statements, 
no  subroutines  or  functions  which  can  modify  data,  and  no  transfer  of  control  to  any  statement 
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outside  the  loop.   In  addition,  innerloop  data  dependencies  have  been  removed  by  using  compiler 
techniques  as  described,  for  instance,  in  [4]. 

Observe  that,  if  we  define  the  index  set  of  the  algorithm  P  as 

F  =  {(iui2,---,Jp)*\lj  <  ^J  <  Wj  for  j  =  l,2,...,p}, 

then  the  elements  of  /''  are  ordered  in  a  lexicographical  ordering  when  the  loops  are  executed. 
We  now  formally  define  our  algorithm  model,  described  intuitively  earlier. 

Definition:  A  p  nested  loop  algorithm  is  a  3-tuple  Ag  -  (P,VAg,FAg)-  Specifically: 

1.  P  -  {(n, . . .  ,?p)'|/j  <  ij  <  Uj  for  J  -  1,. ..  ,p}  is  the  loop  index  set. 

2.  Vaq  is  the  set  of  variables. 

3.  FAg  is  the  sequence  of  statements  computed  in  every  index  in  its  active  phase. 

The  loop  index  set  and  statements  satisfy  the  assumptions  discussed  above.  In  addition,  when 
dealing  with  a  p  nested  loop  algorithm,  we  will  frequently  write  "variable"  for  the  name  of  an 
array,  or  an  entry  of  that  array;  however,  when  dealing  with  a  corresponding  linear  array,  we  will 
write  "variable"  for  the  name  of  an  array,  but  we  will  write  "token"  for  the  entry  of  that  array,  if 
this  does  not  result  in  confusion.  For  example,  we  may  write  variable  A,  variable  A[17],  and  token 

Aim- 

2.2      Linear  Array  Model 

In  this  subsection,  we  present  formally  the  linear-array  model  we  will  use.  It  is  a  modification  of 
the  one  employed  by  Ramakrishnan  et  al.  [33].  It  may  be  helpful  for  the  reader  to  briefly  look  at 
Fig.  1  and  Fig.  2.  As  seen  there,  identical  PEs  are  connected  to  each  other  by  means  of  links  to 
form  a  linear  array. 

Definition:  A  linear  array  is  a  5-tuple  Ar  =  (M,  K.Tat,  Bat^Fat),  where  M  is  the  number  of 
PEs  in  the  array  and  K.Tat,  Bat  and  Fat  form  the  description  of  an  individual  PE.  The  number 
A'  states  the  number  of  data  links  for  each  PE.  Specifically: 

1.  .A/  is  the  number  of  PEs  in  the  array.  They  will  be  numbered  from  left  to  right  as 
PE,,PE2,...,PEm- 

2.  A'  is  the  number  of  data  links.  Every  PE  has  A'  pairs  of  input/output  ports  and  such  pairs 
will  be  referred  by  integers  from  {1,2,...,  A'}. 
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3.  Tat  —  (^1. '2i  •  •  •  i*A')i  where  /,  =  1  or  — 1  for  all  1  <  ?  <  A',  is  the  sequence  of  the  directions 
of  the  (data  streams  flowing  through  the)  links.  If  the  data  stream  i  flows  from  left  to  right 
then  ti  =  1,  if  it  flows  from  right  to  left  then  <,  =  —1. 

4.  Bjir  =  {bi,b2. . . .  ,bj^),  where  6,  >  0  for  all  1  <  ?  <  A',  is  the  sequence  of  the  sizes  of  the 
buffers  (the  number  of  shifting  registers)  corresponding  to  the  hnks. 

5.  Fat  is  the  A'-ary  functions  computed  by  every  PE  in  its  active  phase. 

Thus,  a  linear  array  has  the  following  communication  features: 

•  Each  link  i  is  used  for  communication  between  adjacent  PEs.  Consider  link  i  between  PEj 
and  PEj+\  for  I  <  j  <  M,  if  ^  =  1  then  the  output  link  i  of  PE-j  is  connected  to  the  input 
link  i  of  PEj+i;  if  i,  =  -1  then  the  output  link  i  of  PEj+i  is  connected  to  the  input  link  i 
of  PEj. 

•  External  communication  with  the  host  takes  place  through  certain  ports  designated  as  follows: 
for  each  i  if  /,  =  1  then  the  input  stream  i  enters  /,  of  PEi  and  the  output  is  from  O,  of 
PEm;  if  t,  =  —  1  then  the  input  stream  i  enters  /,  of  PE^  and  the  output  stream  is  from  Oi 
of  PEi-  (/,'s  and  0,'s  are  just  different  terms  for  links  at  the  boundaries  of  the  array.) 

•  At  time  t,  if  a  data  token  of  stream  i  reaches  PEj,  then  at  time  t  +  b,  +  I  this  token  will 
reach  PEj+t,-,  for  1  <  j  +  't  <  M- 

Each  data  stream  consists  of  data  tokens  that  travel  at  a  fixed  velocity  (number  of  PEs  per  clock 
tick)  on  a  unique  data  link.  To  implement  fixed  stream  velocity,  each  data  link  passes  through  all 
the  PEs  and  utilizes  a  constant  number  (possibly  zero)  of  shifting  registers  in  each  PE  as  a  delay 
buffer.  The  buffer  lengths  are  the  same  in  all  data  links  for  a  specific  data  stream,  so  that  for  a 
data  stream  the  link  delay  is  the  same  for  all  tokens  in  all  PE;  however,  different  data  links  may 
have  different  buffer  lengths.  A  PE  is  illustrated  in  Fig.  1  and  a  linear  array  is  illustrated  in  Fig. 
•2. 

In  Fig.  1,  there  are  A'  =  /  +  r  data  streams  1,2,. . . ,/  and  /  +  l,/  +  2,  ...,/  +  r.  The  first  /  data 
streams  flow  from  left  to  right  and  the  last  r  data  streams  flow  from  right  to  left.  In  Fig.  2,  I,/0, 
for  i  —  1,  ...,  I  and  Ii^j/Oi^j  for  j  =  1, . . . ,  r  are  external  input/output  ports  for  the  /  +  r  data 
streams  1,2, ...  /  and  /  +  1,  /  +  2, . . .  /  +  r  respectiy.  Data  tokens  are  fed  into  and  extracted  from 
the  array  through  these  ports. 

In  our  model  we  did  not  allow  <,  =  0,  though  this  is  of  course  a  simple  extension.  We  are 
interested  here  only  in  implementations  in  which  every  data  item  is  pipelined  in  order  to  avoid  the 


3    EXAMPLE  :  MATRIX  MULTIPLICATION  9 

need  for  memory  addressing  and  control  hardware  in  each  PE  and  the  preloading  and  downloading 
of  data.  By  aUowing  t,  =  0  we  can  model  other  implementations,  such  as  WARP  [17],  in  which 
certain  data  items  are  fixed  in  PEs. 

We  will  relate  the  F^r  to  the  F/^g  of  the  algorithm  model.  In  effect,  F^^  =  F^g- 
For  each  problem  "type"  such  as  matrix  multiplication  we  are  interested  in  a  family  of  linear 
arrays,  such  that  each  of  this  family  of  linear  arrays  can  solve  the  problem.  Ideally  we  would  like 
to  be  also  to  use  a  single  linear  array  for  all  values  of  the  problem  size,  so  that  M,  K ,  Tat,  Bat 
and  Fat  could  be  made  independent  of  the  problem  size.  However,  sometimes  we  will  allow  M  and 
6,'s  to  be  functions  of  the  problem  size  while  still  insisting  that  K,Tat  and  Fat  a.re  independent 
of  the  problem  size. 

In  the  next  section,  we  shall  show  how  to  map  a  3  nested  loop  algorithm  onto  a  linear  array. 

3      Example  :  Matrix  Multiplication 

We  find  it  expedient  to  describe  the  method  both  formally  and  by  referring  to  a  running  example 
of  matrix  multiplication.  In  the  Appendix  we  will  state  the  method  concisely. 
Let  us  consider  the  standard  4x4  matrix-multiplication  algorithm. 

Input  :  Matrices  A^-^^  and  i?4x4- 

Output:  Matrix  C4X4,  where  C4X4  —  ^4x4  X  .64x4- 

for  ?  =  0  to  3  ^ 

for  j  =  0  to  3 

for  ^-  =  0  to  3  ,         . 

C\i.j]:=C[uj]  +  A[,,k]*B[k,j] 
endJbr 
end  _for 
end_for 


Here  the  algorithm  model  Ag  =  ({(?', i,/c)'l0  <  ij,k,<  3},  {A,B,C},  (C[i,j]  :=  C[iJ]  +  A[i,k]  * 
B[k,j])).  Our  method  consists  of  several  steps. 

3.1      The  Lamport  condition 

We  reiterate  here  the  fundamental  result  due  to  Lamport  [20]  dealing  with  parallel  execution  of 
nested  loops.  It  was  also  used  by  Banerjee  et  al.  [4],  Kuhn  [12],  and  Moldovan  [28].  First,  we  label 
each  variable  in  the  loop  with  an  index  (t'l,  12,  •  •  •  •,«?)'•  We  will  frequently  write  "variable"  for  an 
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entry  of  a  matrix,  or  the  name  of  the  matrix,  if  this  does  not  result  in  confusion.  If  the  variable 
is  on  the  left-hand-side  of  :  =  ,  it  means  that  the  variable  is  regenerated  in  index  (t'l,  22, . . . ,  ip)*;  if 
the  variable  is  on  the  right-hand-side  of  :  =  ,  it  means  that  the  variable  was  previously  generated  in 
index  (z'l,  J2,  •  •  •  ,ip)'. 

For  matrix  multiplication,  p  =  3.  Now,  the  matrix-multiplication  algorithm  is  described  as: 


Input  :  Matrices  A4X4  and  ^4x4- 

Output:  Matrix  C4X4-  where  C4X4  =  A4X4  X  ^4x4. 

1.  for  i  =  0  to  3 

2.  for  j  =  0  to  3 

3.  for  A-  =  0  to  3 

4.  /!('-'■•'='[?•.  A-]:=  ^(■'^-i-*^)[t,fc]; 

5.  ^('■^■'''[fc.i]:=fi('-i-'''=)[fc,i]; 

6.  C("'J''^)[i,j]:=C(''J'''->'[i,i]; 

7.  C^'-^-^^[iJ]  :=  C^''^-%J]  +  A^''}'%,k]  *  B^'-^'''^[k,j] 

8.  end_for 

9.  end_for 

10.  end_for 


In  line  4,  /!''•■'•''"' [i.  A:]  means  that  /l[i.A:]  is  regenerated  in  step  (z,j,  A-)'  and  A^''^~^'''^[i,k]  means 
that  j4[2,fc]  was  previously  generated  in  step  {i,j  —  1,A;)*  and  is  used  in  step  (i,j.  A)'.  Similarly, 
for  B  and  C  in  lines  5  and  6.  After  having  all  the  needed  data,  in  line  7  it  can  execute  C[i,j]  = 
C[iJ]+  A[i,k]*B{k,j]. 

After  labeling,  one  can  define  data-dependence  vectors.  A  data-dependence  vector  of  a  variable 
can  be  viewed  as  difference  of  indices  where  a  variable  is  used  and  where  that  variable  was  generated. 
From  lines  4,  5  and  6.  it  is  clear  that  the  index  step  {i,j,kY  depends  upon  all  (j'-l,  j,  A')',  {i,j-l,ky, 
and  (i,j.k  —  1)'  index  steps.  Thus,  there  are  three  data-dependence  vectors  in  the  algorithm: 

di  =  (0,1,0)'  for  the  pair  (^(•■•'•'^^[i,  A-], /l(''^-i''^)[i,  fc]) 
d2  =  (1,0,0)'  for  the  pair  (5<''J-'^'[A%  j],  ^(-^•■'•'^^[A:,  j]) 
^3  =  (0,0,1)'  for  the  pair  (C'''^-*)}^  j],C(''^''^-i)[i,  j]) 

We  say  that  di  is  with  A,  ^2  is  with  B,  and  ^3  is  with  C.  In  addition,  we  say  that  all  of  di,d2  and 
c?3  are  related  to  C,  because  C[i,j]  uses  all  /1[j.A-],  B[k,j],  and  C[i,j],  the  variables  with  di,  d^, 
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and  c/3,  respectively.  Therefore,  in  matrix  multiplication  we  say  that  index  I2  depends  on  index  /i 
if  and  only  if  h  —  h  +  "'Wi  +  ni2'^2  +  "'3^3  for  mj,  m2,  m^  >  0  and  at  least  one  of  mi,  m2,  m^ 
is  positive.  These  relationships  can  be  described  by  a  data-dependence  graph  as  shown  in  Fig.  3. 
Lamport  considered  partitioning  of  indices  («i,  12, .  ■  ■  ,ipY  so  that  they  lie  on  a  family  of  parallel 
hyperplanes  such  that  all  indices  lying  on  one  hyperplane  can  be  executed  simultaneously.  Let  H 
be  a  vector  (oi,  02, . . .  ,ap),  then  aiX\  +  02X2  +  •  •  •  +  apXp  =  d  for  various  values  of  d  define  a  family 
of  hyperplanes.  We  will  frequently  write  "hyperplane"  for  a  family  of  parallel  hyperplanes,  the 
vector  defining  them,  or  an  individual  hyperplane  in  the  family,  if  this  does  not  result  in  confusion. 
A  special  case  of  Lamport's  result  is: 

Theorein  1  :  Let  H  =  {01.02-,  ■■■  ,ap)  be  a  hyperplane.  IfHd,  >  0  for  each  data-dependence 
vector  d,  =  (dji.d,2. .  ■  ■  .d,p)  and  two  indices  I\  and  I2  satisfy  H/j  =  H/2  =  c.  then  /j  and  I2  are 
independent  of  each  other,  and  therefore  they  can  be  executed  simultaneously.        O 

An  optimal  Lamport's  hyperplane  H  =  (qj,  Q2, . . .  ,Qp)  can  be  obtained  by  solving  an  integer 
programming  problem  under  the  following  constraints: 

1.  H^d,  >  0  for  all  data-dependence  vectors  and 

2.  H^  =  minH{max{|H(/2  -  /,)|  |  lu  h  G  P}}- 

H  minimizes  the  number  of  hyperplanes  required  to  partition  the  indices  and  gives  an  optimal 
time  for  parallel  implementation  of  the  algorithm  on  an  MIMD  (Multiple  Instructions  and  Multiple 
Data)  machine.  For  matrix  multiplication  H  =  (1,1,1),  that  is,  the  hyperplane  is  x  -\-  y  -\-  z  =  c. 
As  we  shall  immediately  see,  H  defined  above  cannot  be  used  for  implementing  linear  arrays,  as 
there  are  locality  constraints  to  be  satisfied. 

For  convenience  we  will  use  H  to  denote  Lamport's  hyperplanes,  HP~^  to  denote  hyperplanes 
used  for  mappings  on  {p  -  1)-D  arrays,  and  H-^  to  denote  hyperplanes  used  for  mappings  on  1-D 
(linear)  arrays.  Subscripts  may  be  used  to  distinguish  between  different  hyperplanes  in  the  same 
"class." 

3.2      Computation  of  H^  and  S 

It  is  our  goal  to  assign  each  index  of  P  to  both  a  specific  time  instance  and  a  specific  linear  array 
location  by  means  of  a  linear  transformation.  We  can  therefore  describe  the  desired  assignment 
as  a  linear  mapping  from  p  dimensions  into  2  dimensions.  Thus  (ri,i2,  • .  ■  ,ipY  1 — >■  {t,l)  where  t 
specifies  the  time  instance  and  /  the  PE  number.  We  refer  to  this  mapping  as  a  1-D  time  hyperplane 
and  a  1-D  space  hyperplane  linear-array  algorithm  (H-'iS).  This  terms  will  be  explained  in  the 
next  paragraph.  As  finding  the  mapping  is  the  heart  of  the  method,  the  description  of  this  step  will 
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be  quite  long  and  detailed.  We  will  also  state  precisely  in  Section  4  the  conditions  to  be  satisfied 
for  correct  and  efficient  implementation. 

First,  we  define  H-^  and  S.    H     is  a  vector  (h\,h2,  ■ .  ■  ,hp)  and  S  is  a  vector  (si,S2, .. .  ,Sp). 

j  (ii,i2i  •  •  •  lip)'  results  in  a  2  dimensional 


Given  an  index  («i,  J2i  •  •  •  i^p)*,  the  mapping 


vector: 


/?1       /l2 
Si       52 


(?l,i2,  ..  .  ,ip)     = 


hxix  +  /l2?'2  + h  /ipi 


Sill  +  S2»2  + V  Spl 


V'V  _ 


V-P 


I 


where  /  and  /  specify  the  time  and  the  location  of  a  data  token  with  index  (t'l ,  Z2, . . . ,  ipY- 

For  completeness,  we  relate  the  above  to  work  done  on  mapping  nested  loops  to  (p-l)-D  arrays. 

As  mentioned  previously  this  was  studied  by  Kuhn  [12]  and  Moldox-an  [28].  They  mapped  the  p-D 

space  of  indices  into  a  p-D  space:  (z'l,  12,  ■  ■  ■  ,ip)'  ' — >  (<,/i, . . .  ,/p_i),  where  (/i, . . .  ,^p-i)  state  the 

location  of  the  PE  executing  index  (Ji,i2i  •  •  ■  -.ipY  at  time  /. 

Their  algorithms,  which  we  denote  by  (HP"-*^,  S^"   ),  can  therefore  be  described  by  means  of  a 

mapping  described  by  p  nested  loops  on  a  p  dimensional  space.  H^"     is  a  vector  (/ij"  , . . .  ,h^~^) 

5ii  ...  Sip         \ 


and  SP   ^  is  a  matrix 


Given  an  index  (?'i,  12, . . .  ,ip)',  the  mapping 


HP-l 

gp-i 


'(p-i)i 


'•(p-i)p 


/ 


il,«2. 


,ip)    is  a  p  dimensional  vector: 


^11 


where  /  and  (/i. 


■Sip 


•^(p-i)p  / 


(n.'2 ipY  = 


(     h\-U,  ^  ■  ■  ■  +  hl'Hp 

6\\i\  + h  Sip«p 

:     +  ■••  +      : 

V    5(p_i)ill  +  ■••  +  5(p_i)p2p    / 


\ 

(   '   \ 

= 

h 

/ 

\  'p-1  / 

,  /p_i)  specify  the  time  and  the  (p-l)-D  location  of  a  data  token  with  index 
(11,12,...  ,ip)'.  When  given  a  H^,  they  found  a  space  mapping  S^"-^  by  requiring  that  j    ^^p-l    1 

be  nonsingular.  When  given  restricted  interconnection  primitives  and  SP"-*^  first,  they  sometimes 

(  HP~^  \ 
also  did  not  use  H^,  but  had  to  use  another  hyperplane  HP~-^   that  would  make   I     <-,p_i 

nonsingular. 

As  claimed  above,  our  problem  will  be  more  difficult,  as  in  general  we  will  not  be  able  to  find 
any  S  corresponding  to  a  1-D  array  with  H-^  =  H^. 

Note  that  as  all  space  hyperplanes  will  deal  with  1-D  arrays,  we  will  write  S  instead  of  possibly 
more  appropriate  S^. 
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3.3      The  example  continued 

VVe  now  continue  with  our  example.  As  stated  above  H  =  (1,1,1),  that  is,  the  hyperplane  is 
X  +  y  +  z  =  c.  Now  let  us  consider  a  space  hyperplane  S  =  (31,82,83)  such  that  all  the  indices 
in  a  hyperplane  s-^x  +  S2y  +  s^z  =  c  are  mapped  into  PEc  of  the  linear  array.  For  specificity  of 
explanation,  suppose  for  now  that  the  space  hyperplane  is  S  =  (1, 1,  -1).  As  shown  in  Fig.  4,  there 
are  ten  indices  on  the  time  hyperplane  (H^)  x  ^  y  +  z  =  Z:  (0,0,3)',  (0,1,2)',  (0,2,1)',  (0,3,0)', 
(1,0,2)',  (1,1,1)',  (1,2,0)',  (2,0,1)',  (2,1,0)',  and  (3,0,0)'.  Let  us  now  consider  on  which  PEs 
those  indices  are  mapped.  For  convenience,  we  will  assume  that  the  PEs  are  numbered  from  -3 
to  6,  and  not  from  1  to  M  as  formally  required  by  our  model.  Since  the  space  hyperplane  is 
S  :  X  -\-  y  —  z  =  c,  then 

(0,3.0)',(1,2,0)',(2,1,0)',  and  (3,0.0)'      will  be  mapped  to     PE-i, 

(0,2,1)',(1. 1,1)',  and  (2,0,1)'  :  P£i. 

(0,1,2)' and  (1,0,2)'  :  PjE-i, 

(0,0,3)'  :  P^_3. 

However,  we  must  prevent  (0,3,0)'.  (1,2,0)',  (2,1,0)',  and  (3,0,0)'  from  being  mapping  to  PE^ 
at  the  same  time  instance,  here  time  instance  3.  (Similarly,  for  (0,2,1)',  (1,1,1)',  and  (2,0,1)'  to 
PEi  as  weUas  for  (0,1,2)'  and  (1,0,2)'  to  P£-i.) 

Thus,  we  need  to  use  several  different  time  hyperplanes,  at  least  for  this  space  hyperplane.  The 
reader  will  wonder  at  this  point  what  would  happen  for  space  hyperplanes  other  than  our  example 
space  hyperplane  (1. 1,-1).  This  will  be  discussed  formally  later  in  the  paper. 

It  is  our  goal  to  find  a  time  hyperplane  H^  that  will  allow  assignment  of  at  most  one  index  to 
a  PE  at  any  time  instance.  In  order  to  find  such  time  hyperplane,  various  methods  might  be  used. 
We  found  the  most  expedient  to  show  how  to  derive  H^  by  modifying  Lamport's  H^.  Thus,  our 
H-"^  will  be  written  as  a  linear  combination:  H-^  =  H^  +  11,  where  11  is  referred  to  as  an  assistant 
hyperplane.  Our  notation  will  be: 

H        =      (Q1.Q2, . . .  ,ap)      is  Lamport's  hyperplane, 
II        =      (tti,  7r2, . . .  ,7rp)      is  our  assistant  hyperplane, 
H        =      H     +  n  is  our  time  hyperplane,  and 

S  =      (si,S2^- ■  ■  ^Sp)        is  our  space  hyperplane. 

We  now  continue  with  the  four  indices:  (0,3,0)',  (1,2,0)',  (2,1,0)',  and  (3,0,0)'  which  were 
mapped  into  PEj,  at  time  instance  3.  To  "spread"  them  in  time,  we  introduce  an  assistant  hyper- 
plane n  =  (7ri,7r2.  TTs)  that  cuts  the  hyperplane  H  so  that  each  time  at  most  one  of  these  four 
indices  will  be  mapped  to  ^£3.  Suppose  the  assistant  hyperplane  \s  Yi  :  x  ■\-  2z  =  d.  Then  in  the 
time  hyperplane  x  -\-  y  -\-  z  =  2, 
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(0,3,0)' 
(1,2,0)' 

(0,2,1)'  and  (2,1,0)' 
(1,1,1)'  and  (3,0,0)' 
(0,1,2)'  and  (2,0,1)' 
(1,0,2)' 
and      (0,0,3)' 


will  be  executed  at 


X  +  2z  =  0, 
J  +  22  =  1, 
ar  +  22  =  2, 
x  +  2z  =  3, 
x  +  2z  =  4, 
x  +  2z  ^b, 


X  +  2z  =  e. 

At  this  point  (0,3,0)',  (1,2,0)',  (2,1,0)',  and  (3,0,0)'  will  not  be  mapped  into  PE3  at  the  same 
time.  (Similarly,  for  (0,2,1)',  (1,1,1)',  and  (2,0,1)'  to  PE^  as  well  as  for  (0, 1,2)'  and  (1,0,2)'  to 

However,  if  H^  and  11  are  linearly  independent  and  (H  +  n)d,  >  0  for  all  data-dependence 
vectors,  then  we  can  overlap  to  execute  the  indices  according  to  H  +  11  (see  Fig.  .5).  In  fact,  if  11 
slices  H^,  H^  and  II  must  be  linearly  independent.  In  addition,  if  (H^  +  n)d,  >  0  for  all  data- 
dependence  vectors,  then  H  -1-11  still  satisfies  Theorem  1.  So  let  the  new  hyperplane  H^  =  H  -1-11 
=  (01,02,03) +  (7ri,7r2,7r3)  =  (1, 1, 1)  +  (1,0,2)  =  (2, 1,3),  that  is,  H^  ■.2x-\-y  +  dz  =  e. 

Then      (0,3,0)'  will  be  executed  at     2x  +  y -{- 3z  =  3, 


and 


(1,2.0)' 

(0,2,1)'  and  (2,1,0)' 

(1,1,1)'  and  (3,0,0)' 

(0,1,2)*  and  (2.0,1)' 

(1,0,2)' 

(0,0,3)' 


2x  +  2/  +  32  =  4, 
2x  +  y  +  3z  =  b, 
2x  +  y  +  3z  =  6, 
2x  +  y  +  Sz  =  7, 
2x+y  +  3z  ^8, 
2x  +  2/  +  32  =  9. 


Recall  that  from  the  discussion  above,  indices  (0,2,1)'  and  (2,1,0)'  are  executed  in  different 
PEs.  (Similarly,  for  (1,1,1)'  and  (3,0,0)'  as  well  as  for  (0,1,2)'  and  (2,0,1)'.)  All  the  indices 
will  be  mapped  to  the  PEs  according  to  the  new  time  hyperplane  H^  =  (2,1,3)  and  the  space 
hyperplane  S  =  ( 1, 1,  —  1)  as  in  Fig.  6. 

This  mapping  maps  index  (i,j,  A;)'  into  PfJs{i,j,fc)'  =  ^^ii+j-k)  ^.nd  executes  it  at  step 
'H.^(i,j,ky  =  {2i  -\-  j  +  3k).  We  now  describe  the  behavior  of  the  resulting  linear  array: 

1.  The  tokens  of  A  and  B  are  pipelined  and  enter  into  the  linear  array  from  left  to  right;  the 
tokens  of  C  are  pipelined  and  enter  into  the  linear  array  from  right  to  left.  The  tokens  of  A 
are  fed  into  data  link  1,  the  tokens  of  B  are  fed  into  data  link  2,  and  the  tokens  of  C  are  fed 
into  data  link  3.  (The  entrance  times  of  the  tokens  of  A,  B  and  C  wiU  be  computed  later.) 

2.  We  now  consider  the  speed  of  the  token  streams.  Formal  conditions  will  be  given  later.  For 
now  we  examine  Fig.  6.  The  tokens  of  A  flow  at  full  speed,  that  is,  there  is  no  delay  for  the 
tokens  of  A.  The  tokens  of  B  flow  at  half  speed,  that  is,  there  is  one  unit  time  delay  when  a 
token  enters  a  PE,  or  say,  there  is  a  delay  buffer  with  one  shifting  register  for  the  data  link 
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of  B.   Finally,  the  tokens  of  C  flow  at  one  third  speed,  that  is,  there  are  two  units  of  time 
delay,  or  say,  there  is  a  delay  buffer  with  two  shifting  registers  for  the  data  link  of  C. 

3.  When  the  three  tokens  A[i,k],  B[k,j]  and  C[i,j]  enter  PE(,+j_t,)  at  time  2f  +  j  +  3fc,  the  PE 
executes  the  instruction  C[iJ]  :-  C[i,j]  +  A[i,k]  +  B[k,j].        D 

In  this  example,  we  found  a  time  hyperplane  H-^  and  a  space  hyperplane  S  so  that  they  map 
the  matrix  multiplication  algorithm  to  the  linear  array.  In  general,  the  time  hyperplane  H-^  and 
the  space  hyperplane  S  must  satisfy  certain  constraints  which  we  will  state  next. 

4      On  Synthesis  of  Linear  Array  Algorithms 

In  this  section,  we  will  state  formal  necessary  and  sufficient  conditions  to  be  satisfied  by  the 
mapping  (H-^.S)  for  correct  implementation  of  p  nested  for  loop  algorithms  on  linear  arrays. 

Consider  some  p  nested  loop  algorithm.  After  labeling  ( "labeling"  was  introduced  in  Subsection 
3.1),  the  number  of  statements  in  the  loop  body  increases.  However,  all  these  additional  statements 
are  trivial  assignment  statements  (for  example,  x  :=  x\).  as  they  are  used  only  for  defining  the 
data-dependence  vectors. 

Let  DAg  =  {d\,d2 f/u)  be  the  sequence  of  the  data-dependence  vectors.  Each  data  depen- 
dence vector  is  with  a  single  specific  variable  in  V^g  ("with"  was  defined  in  Section  3)  and  each 
variable  is  also  with  at  least  one  data-dependence  vector  ("with"  is  symmetric  here).  There  may 
be  several  data-dependence  vectors  with  a  variable;  however,  no  two  variables  are  with  the  same 
data-dependence  vector.  (Thus  we  allow  c?,  =  dj  for  i  ^  j.)  We  make  a  non-essential  simplifying 
assumption.  We  assume  that  if  </,  =  (dti,d,2,  ■  ■  ■  ,d,p),  then  gcd(d,i, d,2,  ■■.  ,d,p)  —  1.  The  case 
when  gcd(d,i,d,2, . . .  ,d,p)  7^  1  can  be  handled  by  a  simple  extension. 

We  will  now  relate  the  algorithm  Ag  to  an  array  Ar.  In  our  mapping  we  will  need  w  data  links, 
as  we  will  associate  a  dedicated  data  link  with  each  data-dependence  vector.  Thus,  in  general 
A'  >  u',  and  for  simplicity  w-e  assume  that  K  =  w.  d,  will  correspond  to  some  data  link  i'  of  the 
linear-array  model.  For  simplicity,  we  assume  that  i'  =  i. 

We  now  describe  the  relation  between  the  variables  of  Ag  and  the  w  data  links.  If  a  variable 
V  £  V^g  is  with  some  number  fi(V)  data-dependence  vectors,  we  wiU  dedicate  6(V)  links  to  it. 
In  effect,  6{V)  copies  of  that  variable  will  be  "traveling"  through  the  array,  each  in  a  dedicated 
data  link.  A  variable,  in  general,  is  an  array  and  therefore  may  consist  of  many  "atomic"  entries. 
In  each  data  link  dedicated  to  the  variable,  all  those  entries  will  appear  during  the  execution  of 
the  algorithm.  Formally,  we  will  say  that  several  data  streams  (one  per  data  link)  are  associated 
with  each  variable  and  each  data  stream  consists  of  all  the  tokens  (individual  atomic  entries)  of 
the  variable. 
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To  illustrate  this,  consider  some  algorithm  in  which  the  array  ^[1..5,1..5]  is  with  three  data- 
dependence  vectors  da,  d^,  and  d^.  Then  there  are  three  data  links  dedicated  to  A.  Each  entry 
of  A,  e.g.  A[l,4]  participates  in  three  data  streams.  In  effect,  we  have  three  data  streams,  of  25 
tokens  each.  Thus,  A  gives  rise  to  75  tokens. 

We  will  now  discuss  how  the  properties  of  the  algorithm  Ag  and  the  corresponding  data  depen- 
dence vectors  D^g  influence  the  structure  and  the  behavior  of  the  linear  array  Ar  implementing 
Ag.  Consider  some  data  dependence  vector  <f,  with  some  variable  A'.  Let  x  be  any  token  of  the 
data  stream  corresponding  to  d,.  Then  based  on  the  behavior  of  Ag,  we  can  classify  d,  as  being  of 
one  of  the  three  types: 

type  1  :  d,  ^  0  and  the  token  x  is  used  (and  regenerated)  in  aU  indices  of  F  of  the  form  /  -|-  md, 
for  some  I  E  P  and  all  integers  m. 

type  2  :  (f,  7^  0  and  the  token  x  is  used  (on  the  right  hand  side  of  :  =  )  in  7  £  /^  and  is  generated 
once  only  In  I  —  d,  £  /''• 

type  3  :  d,  =  0.  Either  x  is  generated  (on  the  left  hand  side  of  :  =  )  once  only,  or  x  is  only  used 
(on  the  right  hand  side  of  :=)  but  it  is  not  generated  (or  regenerated)  in  any  index  (X  is 
an  input  variable).  (As  we  will  discuss  later,  this  implies  that  the  computed  value  of  x  will 
not  be  used  in  data  stream  i,  or  the  token  x  is  used  only  once  in  data  stream  i  but  is  not 
generated  in  any  data  stream.) 

Lemma  2    (Zero-One-Infinite)  :    The  three  types  are  exhaustive,  that  is,  no  other  case  is  possible. 
D 

In  order  to  discuss  these  three  types,  we  proceed  to  an  example  in  which  they  all  occur.  Consider 
the  following  (Longest  Common  Subsequence)  algorithm: 

Input  :  Arrays  >l[l..m]  and  B[l..n]. 

Output:  Matrix  C[l..m,l..n],  where  C[i,j]  =  the  length  of  the  longest 
common  subsequence  of  >1[1..?]  and  B[l..j]. 
for  i  =  1  to  m 

for  J  =  1  to  n 
ifA[i]=B[j] 

then  C[i,j]:=C[i- I,  j-  1]  +  1 
else  C[i,j]  :=  max{C[i,i  -  l],C[i  -  l,j]} 
end_for 
end_for 
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Here  the  algorithm  model  Ag  =  ({{i.j)'\l  <  i  <  m,  I  <  j  <  n},  {A,B,C},  (if  A[i]  =  B[j]  then 
C[i,j]  :=  C[i  -  1,  j  -  1]  +  1  else  C[i,j]  :=  max{C[i,  j  -  1],C[?  -  l,i]})).  There  are  six  variables 
(i.e.  A[i],  B[j],  C[i  -  1,  j  -  1],  [i,j  -  1],  C[i  -  1,;],  and  C[iJ])  in  the  loop  body.  After  labeling, 
the  algorithm  becomes: 


Input  :  Arrays  >l[l..m]  and  B[l..n]. 

Output:  Matrix  C[l..m,  l..n],  where  C[i,j]  =  the  length  of  the  longest 
common  subsequence  of  .4[1..2]  and  i?[l..j]. 
for  i  =  1  to  m 

for  j  =  1  to  n 

1.  A^''J^[i]  :=  /l(--^-i)[f]; 

2.  5(''J)[j]  :=  5('-i'^'[i]; 

3.  C^''j)[i-lJ-l]:=C^'-'-^-''>[i-l,j-l]; 

4.  c(''J'[(:,i  -  1]  :=  c<''J-^)[^i  -  1]; 

5.  ,      C(''^)[e-l,j]:=C(-i'^'[^-l,i]; 

7.  if /l('-^)[i]- ^''-^Hj] 

then  C^''J\iJ]  :=  C«''J)[i  -  l,i  -  1]  +  1 
else  C(''^'[«'J1  :=  max{C'*'J)[7:,i  -  l],C(''J'[i  -  l,i]} 
end_for 
end_for  * 

From  lines  1  to  6.  we  get  six  data-dependence  vectors, 

di  =  (O.iy  for(^(''J)[']-4<''J-i)[i]),  ' 

d2  =  (l,0)'for(5(''J)[j],5(-i'^)[i]), 

d3  =  11,1)'  for  (C(''^'['  -  l.J  -  l],C('-i'^-^'[i  -  1,  j  -  1]), 

d,  =  (0,1)'  for  (C(■•^)[^,J  -  l],C('-^-i)[i,i  -  1]), 

ds  =  (1.0)'  for  (C'''^'[?-  l,i],C('-i'j)[?:-  1,;]),  and 

d6  =  (0.0)'for(C('-^'[ni].C("'^>[/,i]). 

di  and  0^2  are  of  type  1,  da,  ^4,  and  c/5  are  of  type  2,  and  dg  is  of  type  3. 

If  a  token  is  with  a  type  1  data-dependence  vector,  then  this  token  is  needed  throughout  the 
execution.  In  our  example,  both  A[i]  (with  d^)  and  B[j]  (with  ^2)  are  such  tokens. 
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If  a  token  is  with  a  type  2  data-dependence  vector,  then  this  token  is  not  needed  after  it  is 
used  once.  This  observation  points  out  that  a  token  with  type  2  data-dependence  vector  will  not 
be  an  input  token  or  an  output  token.  For  example,  C[i  —  l,j  —  1],  which  was  generated  in  index 
{i  -  l,j  -  1)',  is  with  ^3.  Thus,  C[i  -  l,j  -  l]  can  be  destroyed  after  it  was  used  in  index  (J,j)'. 
Similarly,  C[i,j  —  1],  which  was  generated  in  index  (i,  j  -  1)',  is  with  ^4  and  can  be  destroyed  after 
it  was  used  in  index  (i,i)*.  Finally,  C[i  -  l,j],  which  was  generated  in  index  {i  -  l,j)',  is  with  ds 
and  can  be  destroyed  after  it  was  used  in  index  (i,  j)*. 

As  we  excluded  the  case  when  Sd,  =  0  in  this  paper,  we  will  not  consider  the  case  when  d,  =  0 
in  the  rest  of  the  paper.  For  completeness  sake,  however,  we  will  present  a  very  brief  discussion.  If 
a  token  is  with  a  type  3  data-dependence  vector,  its  behavior  is  more  complex.  It  is  generated  (in 
the  data  stream  corresponding  to  d, )  but  never  used  again  in  that  data  stream.  Its  value,  however, 
is  not  lost  but  must  be  copied  as  input  value  for  other  data  streams.  For  example,  C[i,  j]  is  copied 
to  data  links  3,  4,  and  5  (corresponding  to  data  streams  3,  4,  and  5). 

If  some  d,  =  0,  in  the  resulting  array  each  PE  will  need  an  I/O  port  for  transferring  tokens 
between  the  host  computer.  Thus,  the  total  number  of  I/O  ports  will  not  be  constant.  As  stated 
in  the  introduction,  in  this  paper  we  are  interested  in  linear  arrays  with  a  constant  number  of  I/O 
ports. 

We  now  discuss  an  important  property  of  data  streams.  Each  data  stream  of  variable  X  with  a 
specific  data-dependence  vector  d,  has  only  one  token  (entry  of  A")  used  (and  regenerated)  in  each 
index.  Therefore,  each  assignment  statement  in  the  loop  body  can  be  seen  as  a  u'-ary  function.  For 
example,  r  =  FAg(vi,V2, .. .  ,i\,),  where  pAg  is  a  statement  in  FAg  and  r,  is  an  entry  of  variable 
V,,  which  is  with  d,. 

Since  all  the  statements  in  F^g  can  be  handled  in  the  same  way,  in  the  following  we  only  consider 
a  single,  representative,  statement  ^4^  instead  of  a  sequence  of  statements  FAg-  For  simplicity,  we 
assume  that  only  one  time  unit  is  needed  to  execute  the  whole  statements  in  the  loop  body. 

A  correct  linear  array  algorithm  (H^,S)  that  maps  a  p  nested  loop  algorithm  Ag  into  a  linear 
array  Ar  must  preserve  data-dependence  relations,  the  right  tokens  must  be  in  the  right  place  at 
the  right  time,  and  in  addition,  data  tokens  must  not  collide  in  data  links. 

It  is  our  goal  to  find  necessary  and  suflRcient  conditions  on  synthesizing  such  linear  array 
algorithms  (H^,S).  We  are  also  interested  in  the  time  complexity  and  the  storage  complexity 
for  (H-'^,S).    Furthermore,  for  a  special  class  of  algorithms,  whose  set  of  the  data-dependence 

vectors  is  {^1,^2,  •  •  •  ,f/u}  =  {(1,0 0)',  (0. 1, . . .  ,0)',  . . .,  (0,0 1)'},  we  are  interested  in 

tight  bounds  on  time  complexity  of  their  linear  array  implementations.  This  class  of  algorithms  is 
of  particular  interest,  as  it  includes  our  example  —  matrix  multiplication,  and  algorithms  to  solve  L- 
U  decomposition,  inversion  of  nonsingularly  triangular  matrix,  matrix  orthogonal  triangularization, 
a  version  of  transitive  closure  algorithm  [7],  DFT,  bubble  sort  [21],  and  others. 
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4.1      Necessary  Conditions 

We  now  consider  conditions  to  be  satisfied  by  (H^,S)  in  order  to  assure  a  correct  construction. 
There  are  five  necessary  conditions  for  (H   ,S). 

1.  H^  has  to  preserve  the  data-dependence  relation,  that  is,  if  h  -  h  -  d,  for  any  two  indices 
/i  and/2  then/2  must  be  executed  after /i.  That  is,  Hl/2-Hl/i  >  0,  or,  H1(/2-/i)  =  H^d,  >  0. 

Condition  1  :  H^cf,  >  0  must  bt  true  for  all  data- dependence  vectors  d,. 

Note:  This  follows  immediately  from  Lamport's  result  [20]. 

2.  No  two  indices  /i  and  I2  can  be  mapped  to  the  same  PE  at  the  same  time,  that  is,  H  /i  =  H  /2 
and  S/i  =  S/2  can  not  be  both  true  at  the  same  time. 

Condition  2  :  ///i  and  I2  ore  two  indices  of  P  then 

"')"-'■"* (2)-  '""''•  ("s  )<'--"'^'''^(S 

for  {Ij  -  Uj)  <  X J  <  (Uj  —  Ij)  and  j  =  1, . . . , p. 

Note:  This  Condition  is  related  to  the  nonsingularity  condition  of  Kuhn  [12]  and  Moldovan 
[28].  Varman  and  Ramakrishnan  [40]  also  gave  a  special  case  of  Condition  2  to  construct  an  array 
for  matrix  multiplication. 

3.  Next,  suppose  a  variable  with  the  data-dependence  vector  d,  is  generated  in  index  /  and  will 
be  used  next  time  in  index  I  +  d,.  Index  I  is  executed  in  PE^j  at  time  H^I.  and  the  index  /-+-  di 
will  be  executed  in  /'/^S(7+d,)  ^*  time  H^(7-)-  c?,).    Thus  the  corresponding  token  is  in  PE^j  at 

time  H^7  and  is  in  PEs(i+d,)  at  time  H^{!  +  d,).  Therefore,  "s{i^^;}ls;  ^  =  ^^  must  be  an 

integer,  as  this  token  is  delayed  by  a  constant  amount  of  time  in  each  PE.  (Note:  if  Sd,  >  0  then 

g^'^'  is  positive,  and  that  token  is  delayed  by  ^f'  time.   If  Sd,  <  0  then    ^f'  is  negative,  and 

Til    J 

that  token  is  delayed  by  — gj-^  time. ) 

We  now  examine  the  data  flow  behavior  of  data  stream  i.  Consider  a  token  with  data  depen- 
dence vector  d,.  From  the  discussion  above,  it  follows  that  it  wiU  be  delayed  for  |  g^  '  |  time  while 
travelling  through  one  PE.  One  time  unit  is  allocated  to  the  processing  time  and  therefore  we  have 
to  account  for  the  remaining  |  g^  '  |  -  1  time  units.  To  accomplish  that,  we  need  |  g^  '  |  —  1  shifting 
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registers  in  the  data  link  i. 

Condition  3  :  6,,  the  number  of  shifting  registers  in  each  PE  for  the  data  link  i,  must  be  |  gj  '  |~^- 

Sometimes,  when  we  wish  to  refer  directly  to  a  condition  on  the  value  of  g^  ' ,  we  may  use 
Condition  3': 

Condition  3'  :   ^/'   must  be  an  integer  for  all  data-dependence  vectors  d,. 

4.  Now,  define  the  positive  direction  of  the  linear  array  to  be  from  left  to  right  and  the  negative 
direction  to  be  from  right  to  left.  Then,  a  token  with  data-dependence  vector  di  that  is  generated 
in  index  I  is  executed  in  PE^j  at  time  H^J  and  will  be  used  in  index  I  +  d,  in  PE^,j_^_^  ^  at  time 
Hi(7  +  d,).  Since  U^(I  +  d,)  >  H^/,  if  Sd,  >  0  then  P£s,;+j_)  >  PEgj  and  the  token  wiU  flow 
from  PEgj  to  PjEs(7+d, )  '^^  ^^^^  positive  direction,  and  if  Srf,  <  0  then  PE^,j^_j',  <  PE^j  and  the 
token  will  flow  from  PE^j  to  P-Es(/+d,)  ^'^  ^^^^  negative  direction.  Let  t,  =  1  denote  that  the  data 
stream  i  has  the  positive  direction,  and  let  t,  —  —I  denote  that  the  data  stream  i  has  the  negative 
direction.  Then  we  restate  the  above  as  the  condition: 

Condition  4  :  If  Sd,  >  0,  then  t,  —  1  and  the  data  stream  i  will  be  fed  into  the  linear  array  at 
P^miniSlilhel'']-  If^<ii  <  0)  "*c"  ^1  =  ~1  (^^d  the  data  stream  i  will  be  fed  into  the  linear  array 
«^  -P-E'max{S/2|/2e/P}- 

Note:  (1)  Ramakrishnan  et  al.  [33]  also  gave  a  similar  condition  to  Condition  4,  but  they  did 
not  base  it  on  the  data-dependence  vector  d,.  (2)  If  Sd,  =  0  then  t,  —  0  and  the  data  stream  will 
be  fixed  in  the  PEs.  However,  as  stated  above,  we  do  not  consider  Sd,  =  0  in  this  paper. 

Lemma  3  :   Let  (H^,S)  satisfy  Condition  3  and  Condition  4  and  let  token  x  be  with  d,.   If  x  is 


in  PEa  at  time  Ta  and  will  be  in  PEb  at  time  Tb,  then  ^_^°  =  t,{b,  +  1)  =  -g^ 
Proof  :  Immediately  from  Conditions  3  and  4.        D 


Corollary  4  :  Let  (H^,S)  satisfy  Condition  3  and  Condition  4  and  let  token  x  be  with  d,.  If  x  is 
used  in  index  I  in  PE^j  at  time  H^ I  and  will  be  used  in  index  I  +  d,,  then  x  will  be  in  PEgrj^^^ 
at  time  H^(7+  rf,). 
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Proof  :  Immediately  from  Lemma  3.        D 

We  can  now  derive  the  entrance  time  of  each  data  token. 

Corollary  5   :    Suppose  token  x,   whose  data-dependence  vector  is  d,  of  type  1,   arrives  at  some 

PEa  at  time  Ta- 

IfSd,  >  0,  then  x  is  fed  into  the  linear  array  into  ■P.E'min{S7|76/p}  ^^  time 

ttI  J 

r„-(a-min{S7|/e/P})-^. 
IfSd,  <  0,  then  x  is  fed  into  •P-E'niax{S/|/€/p}  °^  time 

ra-(max{S7|7G/P}-a)-g^. 
Proof  :  From  Lemma  3,  by  linear  interpolation.        D 

Corollary  5  shows  that  once  we  know  that  x  enters  some  PE  at  some  time,  then  we  can  obtain 
the  entrance  time  of  x.  On  the  other  hand,  if  x  enters  the  linear  array  at  the  specific  entrance 
time,  then  x  will  be  in  that  PE  at  that  time. 

5.  It  may  seem  that  we  have  identified  all  "major"  necessary  conditions.  This,  however,  is  not  the 
case.  So  far  we  have  only  examined  the  behavior  of  individual  tokens.  Now  we  will  consider  the 
possible  interference  between  tokens  of  the  same  data  stream. 

Let  x\  and  X2  be  tokens  of  the  variable  A'  in  the  data  stream  corresponding  to  d,.  (Thus,  of 
course,  d,  is  with  A'.)  x-y  is  used  in  index  I\  and  X2  is  used  in  index  l2-  Conceivably  X\  =  X2-  To 
characterize  when  x\  ^  X2-  we  have  two  lemmas,  depending  on  the  type  of  d,.  As  the  case  where 
it  is  of  type  2  is  trivial,  we  deal  with  it  first. 

Lemma  6  :  Let  xi  and  X2  be  tokens  of  variable  X  in  the  data  stream  corresponding  to  di,  which 
is  of  type  2.  Let  ij  be  used  in  some  index  I\  and  let  X2  be  used  in  some  index  I2.  Then,  Xi  ^  X2 
if  and  only  if  I\  ^  l2- 

Proof  :  Immediately  from  the  fact  that  each  token  is  used  in  one  index  only,  and  no  two  tokens  of 
any  data  stream  are  used  in  the  same  index.        D 

Lemma  7  :  Let  x\  and  X2  be  tokens  of  variable  X  in  the  data  stream  corresponding  to  d,,  which 
is  of  type  1.  Let  X\  be  used  in  some  index  I\  and  let  X2  be  used  in  some  index  I2.  Then,  Xi  ^  X2 
if  and  only  if  I2  —  I\  is  not  an  integer  multiple  of  d,. 
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Proof : 

If  part:  We  want  to  show  that  if  Xi  =  X2  then  h  -  h  -  rndi  for  some  integer  m.  Assume  that 
/o  is  the  first  index  in  which  token  xi  (X2)  is  used  (and  is  also  regenerated).  Then  /q  <  I\  and 
Iq  <  I2  because  of  lexicographical  ordering.  From  the  definition  of  the  data-dependence  vector, 
/i  =  7o  +  mid,  and  I2  =  Iq  +  m2di  for  some  non-negative  integers  mi  and  m2-  Therefore,  I2  -  h 
—  (m2  —  Tni)d,  =  md,  for  some  integer  m. 

Only  if  part:  We  want  to  show  that  if  xi  7^  X2  then  I2  —  I\  i^  mdi  for  all  integers  m.  Let  1\  be 
the  first  index  in  which  token  xi  is  used  (and  is  also  regenerated)  and  let  I2  be  the  first  index  in 
which  token  X2  is  used  (and  is  also  regenerated).  Then  xi  will  be  used  (and  will  be  regenerated) 
in  all  indices  of  the  form  /i  +  m\d^  £  /p  and  X2  will  be  used  (and  will  be  regenerated)  in  all  indices 
of  the  form  I2  +  m2rf,  €  /''. 

Let  us  examine  whether  we  could  have  I2  =  Ii  +  rhitf,  for  some  integer  mi.  If  this  equality 
holds,  then  both  xi  and  X2  are  used  in  the  index  I2  =  I\  -\-  rhidi.  However,  as  stated  at  the 
beginning  of  Section  4,  at  most  one  token  of  any  data  stream  can  be  used  in  an  index.  Thus, 
{I2  —  I\)  ^  "'t/i  for  all  integers  m. 

Then,  from  the  definition  of  data-dependence  vector,  7i  =  /i  -|-  Tn\di  and  I2  =  I2  -\-  m2di  for 
some  non-negative  integers  mi  and  m2.  Therefore,  I2  —  Ii  =  {I2  —  A)+  ("^2  ~  mi)dt  /  md,  for  all 
integers  m.        D 

We  will  examine  the  case  where  (I2  —  Ii)  ^  'mdi  for  all  integers  m.  Then  we  will  show  that  we 
cannot  have  H^(72  —  I\)Sd,  =  S(/2  —  /i)H  d,,  as  otherwise  collisions  would  occur  in  data  links. 
As  this  is  rather  non-intuitive,  we  start  with  an  example. 

Let  H-^  =  (2,1,2)  and  S  =  (1.1,-2)  for  the  matrix  multiplication  algorithm.  This  map- 
ping (H  ,S)  satisfies  Conditions  1  through  4  and  the  resulting  behavior  is  described  in  Fig-7. 
Observe  that  C[0,3]  collides  with  C[2,0]  and  C[l,3]  collides  with  C[3,0]  because  H^((2,0.0)*  - 
(0,3,0)')S(0,0,1)'  =  S((2,0,0)'-(0,3.0)')Hl(0.0,l)'  =  -2  and  (2,0,0)'  -  (0,3,0)'  /  m(0,0,l)' 
as  weU  as  H1((3,0,0)'  -  (1,3,0)')S(0, 0, 1)'  =  S((3.0.0)'  -  (l,3,0)')Hl(0, 0, 1)'  =  -2  and 
(3,0,0)'  -  (1,3,0)'  ^  m(0,0,l)'  for  any  m.  To  prevent  this,  we  have  the  condition: 

Condition  5  :  If  {I2  -  h)  7^  md,  for  all  integers  m,  then  H^(/2  -  I\)Sd,  /  S(/2  -  Ii)B.^di. 
Lemma  8  :  Condition  5  is  necessary. 
Proof  :  We  consider  three  cases: 
1.  H^/i  =Hl/2. 
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From  Condition  2,  it  follows  that  S/j  ^  S/j.    U^ih  -  Ii)Sd,  =  (U'^h  -  ll^h)Sd,  =  0; 
however,  from  Condition  1,  H^d,  >  0  and  therefore  S(/2  -  /OH^d,  =  (S/2  -  S/OH^d,  yt  0. 

2.  S/i  =  S/2. 

From  Condition  2,  it  follows  that  H^/2  ^  H^/i.  As  we  assume  that  Sd;  7<^  0,  H^(/2-/i)Sd,  7^ 
0.  However,  as  S/2  =  S/j,  S(/2  -  /i)H^d,  =  0. 

3.  H^/i  7^  HV2  and  S/i  7^  S/2. 

Let  xi  and  a-2  be  two  tokens  of  the  variable  A'  corresponding  to  rf,,  such  that  x^  is  the  token 
used  in  /j  and  X2  is  the  token  used  in  h-  Consider  two  subcases: 

(a)  d,  is  a  type  1  data-dependence  vector. 

Since  h  -  h  ^  "^^^i  for  all  integers  m,  from  Lemma  7,  xi  /  X2-  In  addition,  Xi  is  in 
P^S/i  at  time  H-^/i  and  X2  is  in  PE^i^  at  time  H^/2. 

Without  loss  of  generality,  let  H-^/i  <  H-^/2.  Let  PEb  be  the  PE  in  which  xi  is 
located  at  time  H^/2.  Then  by  Lemma  3,  (H^/2  -  H^/i)Sd,  =  (6  -  S/i)H^d,.  Assume 
by  contradiction  that  H^ih  -  Ii)Sd,  =  S(/2  -  h)H^d,.  Then,  S(/2  -  h)ll^di  = 
(b  -  SIi)}l^d,.  From  here,  using  H^d,  7^  0,  6  =  S/2.  Thus,  the  tokens  Xi  and  X2  both 
use  data  link  /'  and  appear  in  PE^j^  at  time  H-^/2.  It  is  "data  collision,"  which  is  not 
aUowed.  Therefore,  H^(/2  -  h)Sd,  7^  S(/2  -  h)li^d,.     , 

(b)  df  is  a  type  2  data-dependence  vector. 

Let  Xk  be  a  token  of  rf,  generated  in  7^  -  d,  e  /''  and  used  in  7^  €  P-  It  is  used  once 
only  and  therefore  its  value  can  be  destroyed  afterwards.  However,  we  will  view  Xk  in 
a  natural  way  as  being  a  member  of  a  certain  sequence  of  tokens.  More  specifically,  let 
i^,  =  max{i|7;,  -  jd,  e  /P},  7/'"'  =  Ik-  jr,d„  and  let  7{""'  be  the  last  index  in  P  of 
the  form  7/"^^'  -f  md,. 

There  is  a  sequence  of  tokens  naturally  associated  with  the  sequence  of  indices  7^"^*  , 
7^'"'  -\-  d,.  . . . ,  7["*'.  We  denote  this  sequence  by  xl.  Observe  that  the  xl  is  analogous 
to  a  single  token  of  a  data  stream  with  type  1  data-dependence  vector.  Perform  the 
above  for  our  tokens  xi  and  X2  obtaining  sequences  of  tokens  x\  and  X2-  The  rest  of  the 
proof  proceeds  similarly  to  the  second  paragraph  of  the  case  3(a)  replacing  xi  there  by 
our  x\  and  replacing  X2  there  by  our  x'2- 
D 


An  immediate  Corollary,  which  we  wiU  sometimes  find  useful,  is: 
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Condition  5'  :  If  d,  ^  mdj  and  dj  ^  md,  for  all  integers  m,  then    g^  '   7^     g^  ■' . 

We  summarize  the  preceding  discussion  in: 

Theorenn  9  :  A  linear-array  algorithm  (H^,S)  that  maps  correctly  a  p  nested  loop  algorithm  Ag 
into  a  linear  array  Ar  must  satisfy  Condition  1  through  Condition  5.        D 

4.2      Sufficient  Conditions 

By  assuming  Condition  1  througli  Condition  5.  we  will  show  that  (H^,S)  will  preserve  the  data- 
dependence  relation,  the  right  tokens  will  be  in  the  right  place  at  the  right  time,  and  in  addition, 
no  data  tokens  will  collide  in  data  links.  Therefore,  Condition  1  through  Condition  5  are  not  only 
necessary,  but  also  sufficient. 

Theorem  10  :  A  linear-array  algorithm  (H'^,S)  from  a  p  nested  loop  algorithm  that  satisfies 
Condition  1  through  Condition  5  maps  that  p  nested  loop  algorithm  Ag  correctly  into  a  linear  array 
Ar. 

Proof  :  Formally,  the  PEs  in  the  array  should  be  numbered  from  1  to  A/,  and  S  should  map  the 
indices  into  {PEi, . . . ,  PEf,j].  However,  it  will  be  convenient  to  continue  numbering  the  PEs  from 
min{S/i|/i  G  /^}  to  max{S/2|/2  €  P)  in  this  proof. 

First,  from  Condition  1,  H^  preserves  the  data-dependence  ordering. 

Second,  we  show  that  the  "needed"  tokens  will  flow  to  the  right  place  at  the  right  time,  in  ad- 
dition, Fat  =  Faq  C'FAg"  was  introduced  at  the  beginning  of  Section  4).  Because  there  are  A'  =  w 
data-dependence  vectors,  we  can  let  pAg  be  the  A'-ary  functions  executed  at  each  loop  index  7  of  7^. 
We  will  only  consider  assignment  statements,  as  other  types  of  statements  can  be  handled  by  trivial 
modifications.  Consider  then,  a  typical  assignment  statement  r  :=  FAg{v-i,V2. .  ■ .  ,vk),  where  v, 
is  the  token  (variable)  with  the  data-dependence  %'ector  d,  which  was  generated  in  7,  (7,  G  /''). 
We  will  now  show  by  induction  on  H^7  that  all  tokens  u,  arriving  at  PEj  at  time  H^7  will  have 
correct  values. 

Basic  step:  When  H^7  =  min{H^/i|/i  £  P},  all  tokens  u,  are  initial  input/output  tokens  (vari- 
ables). From  Conditions  3  and  4  or  Corollary  5,  v,  will  rearch  PE^t  at  time  H^7.  Thus,  Fat 
performs  the  same  function  as  pAg  in  7. 

Hypothesis  step:   Before  the  time  H^7,  suppose  that  under  (H^,S)  pAr  has  generated  the  same 
values  as  /Ug. 
Induction  step:  At  time  H^7.  from  the  definition  of  the  data-dependence  vectors  we  have 

I=h+d,=l2  +  d2  =  ...=  lK  +  dK,  (1) 
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for  appropriate  7i.72, . . .  ,/a'.  From  the  hypothesis  step,  Vi  was  regenerated  with  correct  value  in 
PE^j  at  time  H-^/,.  However,  v,  was  not  used  (and  was  not  modified)  between  the  time  instances 
H-^7,  and  H-^(7,  +  d,)  (not  including  H-^7,  and  H-^(7,  +  d,)).  Then  from  Conditions  3  and  4  or 
Corollary  4.  i',  will  reach  P-Es/.+Sd,  at  time  H^7,  +  H^c?,.  But  from  (1),  S7  =  S(7,  +  d,)  and 
H^7  =  H^(7,  +  d,)\  therefore,  all  of  the  tokens  v-i,V2,.  ■  ■  ,ua'  arriving  at  PE^j  at  time  H-^7  will 
have  correct  values.  Fat  will  thus  perform  the  same  function  as  F^^  in  7. 

Third,  we  will  show  that  no  two  tokens  (variables)  collide  in  any  data  link.  Assume  by  contra- 
diction that  data  link  i  has  two  distinct  tokens  xi  and  X2  of  the  variable  A'  which  collide  in  PEa 
during  the  execution.  (Thus,  of  course,  d,  is  with  A'.)  Suppose  then  that  xj  is  then  immediately 
used  in  index  /i  in  -P-Es/i  ^^  time  H  /]  and  X2  is  then  immediately  used  in  index  I2  in  PE^j  at 
time  H-'^/j.  (If  Xi  or  X2  will  not  be  used  again,  then  Xi  was  just  generated  in  index  /i  or  X2  was 
just  generated  in  index  l2-)  Consider  two  cases: 

1.  HVi    =HV2. 

Our  assumption  is  that  at  some  time  instance  x\  and  xj  collide  in  PEa,  and  then  x\  is  in 
P£s/i  at  time  H  /i  =  H  /2  and  X2  is  in  PPg/j  at  time  H  I2  —  H  /i.  However,  since 
all  tokens  of  a  data  stream  flow  at  the  same  speed,  we  have  PEqj^  =  PE^j^  contradicting 
Condition  2. 

2.  H^/i  ^  H'^h. 

Without  loss  of  generality,  H'^/j  <  H^/2.  Since  x^  and  X2  flow  at  the  same  speed,  X2  will 
flow  from  PEa  to  PE^j^  at  time  H^/i  and  then  to  PE^j^  at  time  H'^/2.  As  all  tokens  flow 
with  non-zero  velocity.  SI\  /  S/2.  Consider  two  subcases: 

(a)  I2  —  I\  i^  rndi  for  all  integers  m. 

From  Lemma  3.  token  X2  is  in  PEsi^  at  time  H^/j,  is  in  PP^s/j  at  time  H^/2»  and 
therefore      g/^Ig;  ^'  =    g/'  •  However,  this  contradicts  Condition  5. 

(b)  I2  -  h  —  T^tdi  for  some  integer  m. 
Consider  two  cases: 

i.  d,  is  a  type  1  data-dependence  vector. 

From  Lemma  7,  Xx  =  X2-  However,  this  contradicts  our  assumption  that  X\  ^  X2. 
ii.  <f,  is  a  type  2  data-dependence  vector. 

Since  X2  is  used  in  I2,  X2  is  generated  only  in  l2  —  di.  However,  since  X2  was  in  PE^j^ 
and  I2  /  /i  and  I2  -  I\  =  md,  for  some  integer  m,  I^  —  I2  —  d,  and  PEa  —  PE^j^. 
From  the  definition  of  type  2  data-dependence  vector,  x\  will  not  be  used  again 
after  it  is  used  in  I\.  Therefore,  when  X2  is  generated  in  PEs(j^_j^)  =  PEgj^,  xi  is 
destroyed.  Thus,  no  collisions  will  occur. 
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Finally,  from  Theorem  1,  (H^,S)  can  be  performed  in  parallel.        D 

4.3      Storage  and  Time  Complexity 

In  this  subsection,  we  study  the  storage  complexity  and  the  time  complexity  of  the  mapping  (H^ ,  S). 

Theorem  11    : 

1.  Total  number  of  PEs  is  M  =  max{|S(/2  -  h)\\  hJi  ^  I^]  +  1- 

2.  Total  execution  time  is  T  =  C)(max{|H^(/2  -  /i)|  |  luh  €  /''}  +  A')\ 

where  N  =  M(1  +  E£i^)- 

Proof : 

1.  Follows  immediately  from  the  structure  of  the  mapping. 

2.  The  time  complexity,  T,  of  the  linear  array  implementation  is  the  time  elapsed  between  the 

instance  the  first  token  is  input  into  the  array  and  the  instance  the  last  token  is  output  from 
the  array.  Let  T'  be  the  time  elapsed  between  the  instance  the  first  index  is  executed  and 
the  instance  the, last  index  is  executed.  Both  the  time  between  the  instance  the  first  token 
is  input  into  the  array  and  the  instance  the  first  index  is  executed  and  the  time  between  the 
instance  the  last  index  is  executed  and  the  instance  the  last  token  is  output  from  the  array 
are  less  equal  to  il/(max{6,|l  <  i  <  K]  +  1).  Thus  T  <  T'  +  2i\/(max{6,!l  <  i  <  K]  +  1). 

As  T'  =  max{lH^/2  -  /i)|  |  luh  €  I"}  and  M(max{6,|l  <  i  <  K)  +  1)  <  N ,  the  proof 
follows  immediately.      D 


Until  now,  in  the  linear  array,  PEs  were  numbered  from  min{S/i|7i  G  P]  to  max{S/2|/2  G  P}. 
Using  M  from  Theorem  11,  we  can  renumber  PEs  from  1,  2,  ....  A/  as  is  formally  required.  We 
will  say  that  (H-^,S)  uses  an  N -storage  linear  array,  where  N  =  M(l  +  X^^j  6,),  because  N  is  the 
sum  of  the  total  number  of  the  PEs  and  the  local  registers. 


Corollary  12   :   Total  execution  time  is  T  =  il{N). 

Let  /  and  g  be  two  functions  of  argument  i.  f  =  0{g)  if  f  <  eg  for  some  constant  c  and  for  sufficiently  large  x. 
f  =  Q(g)  a  f  >  eg  for  some  constant  c  and  for  sufficiently  large  x.  f  =  Q(g)  if  /  =  0(g)  and  /  =  n(s).  /  =  o(g)  if 
fig  — •  0  as  z  — •  oo. 
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Proof  :  Since  there  is  at  least  one  data  token  passing  through  data  link  ?  for  1  <  i  <  A'  and  since 
K  is  independent  of  the  problem  size  n  and 

JV/(max{6,ll  <  i  <  K}  +  1)  <  A'  <  A'M(max{6,ll  <  i  <  K}  +  1), 

the  proof  follows  immediately.        D 

We  now  want  to  consider  the  case,  where  the  set  of  the  data-dependence  vectors  is  {di,d2,  ■ . .  ,duj} 
=  {(1,0, .. .  ,0)',  (0, 1, . . .  ,0)',  . . .,  (0,0, ... ,  1)*}.  As  mentioned  above  such  algorithms  include  al- 
gorithms to  solve  matrix  multiplication,  L-U  decomposition,  inversion  of  nonsingularly  triangular 
matrix  [9],  matrix  orthogonal  triangularization,  a  version  of  transitive  closure  algorithm  [7],  DFT, 
bubble  sort  [21],  and  others. 

We  are  interested  in  tight  bounds  on  time  complexity  of  linear  array  implementations  of  these 
algorithms. 

Theorem  13  :  If{du  ^2,  ...,  d^.}  =  {(1,0 0)',  (0, 1, . . .  ,0)',  ...,  (0,0,...,!)'}  is  the  set  of 

data-dependence  vectors  in  D^g,  then  the  execution  time  is  T  =  Q{N). 

Proof  :  There  are  exactly  p  distinct  data-dependence  vectors  in  the  set  {d\,d2. .  ■ .  ,dy^,},  so  we 
assume,  without  loss  of  generality,  that  K  =  w  =  p  and  di  =  (1,0, .. .  ,0)',  d2  =  (0, 1, . . .  ,0)',  . . . , 
and  dp  =  (0,0. . . . ,  1)'.  From  Condition  1,  (/i,/2, . . .  ,/p)'  must  be  the  first  index  to  be  executed 
and  (ui,  U2, . . . ,  Up)'  must  be  the  last  index  to  be  executed,  because  H^d^   >  0,  H^rf2  >  0,  ..., 

and  H^t/p  >  0.    In  addition,  (ui,U2 UpY  -  (I1J2 ,lpY  -  (ui  -  l-i)di  +  (U2  -  l2)d2  +  ■•• 

+  [Up  -  lp)dp.  Since  both  (/1./2 ,lp)'  and  {li,l2,  ■  ■  ■  JpY  +  (u,  -  l,)d,  will  be  mapped  into  the 

linear  array,  (u,  -  l:)\Sd,\  <  M.  Let  T'  =  H^{uuU2, . . .  ,UpY  -  H'^il^h,  ■  ■  ■  JpY ■  then, 

T'     =  EfLi(", -/JHlrf,  (because  A-  =  p) 

=  E^=i(«< -/,)|Scf,|(6,  + 1)  (fromLemma3) 

<  Ef=i  M(b,  +  1)  (because  (u,  -  U)\Sd,\  <  M) 

<  MiElUb.  +K). 

Since  A'  is  independent  of  the  problem  size  7?,  T'  =  0{N).  In  addition,  T  <  T'-)-l-|-2M(max{6i|l  < 
7  <  A'}  -I-  1)  =  O(A^).  Finally,  from  Corollary  12  we  have  T  =  0(7V).        D 

Theorem  13  shows  that  the  tight  bound  on  time  complexity  of  this  class  of  algorithms  is  the 
same  as  the  storage  complexity.  We  can  use  this  rather  general  result  to  show  that  some  mappings 
have  both  the  optimal  time  complexity  and  optimal  storage  complexity. 
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As  an  example,  consider  the  nxn  matrix  multiplication  algorithm  Ag  =  ({{i,j,ky\0  <  i,j,k,  < 
n-1},  {A,B,C},  iC[i,j]  :=  C[iJ]  +  A[i,k]*B[k,j])).  In  Section  3  (and  also  in  [31]),  a  linear  array 
algorithm  with  H^  =  (2,l,n-  1)  and  S  =  (1,1,-1)  was  synthesized  for  this  algorithm.  (Actually, 
Section  3  dealt  with  n  —  4,  but  the  extension  to  general  n  is  trivial.)  In  this  mapping,  the  total 
number  of  PEs  is  (l,l,-l)((7i-  l,n  -  l,0)'-(0,0,n-  1)')  =  3n-2.  In  addition,  there  is  one 
register  for  A  (6i  =  1),  no  registers  for  B  (62  =  0),  and  n  -2  registers  for  C  (63  =  n  -  2).  Thus,  the 
total  storage  complexity  of  the  linear  array  is  A'  =  (3n  -  2)  E?=i(^.  +  1)  =  (3n  -  2)(2  +  1  +  n  -  1) 
=  Q{n^).  As  by  Theorem  13,  T  =  0(A),  we  see  that  T  =  Q(n^). 

We  now  show  that  this  linear  array  is  time  optimal.  As  there  are  3n^  tokens  to  be  read  and 
there  is  only  a  constant  number  (three)  of  Input  ports,  the  input  time  required  is  Q{n'^),  and  thus 
the  algorithm's  time  complexity  is  optimal. 

We  will  now  show  that  the  storage  complexity  must  be  ri(n^),  thus  proving  that  the  array  is 
storage  optimal.  Assume  by  contradiction  that  for  some  array  implementing  matrix  multiplication 
N  =  o{n^).  Then  by  Theorem  13,  T  =  o(n^)  contradicting  the  fact  that  T  =  Q(n^).  Thus  the 
algorithm  above  was  optimal.  Note:  Ramakrishnan  and  Varman  [32]  also  obtained  this  bound  by 
reducing  matrix  multiplication  to  a  game  played  with  tokens  on  an  undirected  graph. 

As  we  wiU  see  later,  all  the  linear  array  algorithms  for  matrix  multiplication  in  Sections  5  and 

6  have  both  the  optimal  time  complexity  and  the  optimal  storage  complexity.  The  storage  can  be 
distributed  in  between  0(r?)  and  0{n^)  PEs  depending  on  additional  objectives  of  the  designer. 

5      The  trade-offs 

In  this  section,  we  consider  the  trade-offs  between  the  time,  the  number  of  PEs  and  the  number  of 
local  registers  in  each  PE. 

In  Subsection  3.3,  we  showed  how  to  find  the  time  hyperplane  H^  —  (2,1,3)  when  given  a 
space  mapping  S  =  (1,1,-1)  for  matrix  multiplication.  Furthermore,  in  Section  4  we  specified 
conditions  to  be  satisfied  in  deriving  a  time  hyperplane  H^  for  a  given  space  mapping  S.  We  now 
want  to  consider  explicitly  the  case  where  we  are  restricted  to  designing  the  linear  array  with  a 
specific  type  of  PE  provided  to  us  in  order  to  avoid  custom  design  of  the  PEs.  Alternatively,  for 
fabrication  reasons,  we  may  be  restricted  to  designing  PEs  with  /  data  links  of  positive  direction 
and  r  data  links  of  negative  direction  for  specific  /  and  r.  To  further  elaborate,  it  is  of  importance 
to  consider  such  restrictions  because  we  may  have  only  one  type  of  PE  available,  or  for  reasons  of 
mass  production  we  wish  to  produce  only  a  smaU  number  of  types  of  PEs  but  still  should  be  able 
to  use  them  to  design  arrays  solving  a  variety  of  problems.  Such  restrictions  give  rise  to  additional 
constraints  on  the  design  which  we  will  now  consider. 

We  continue  with  the  example  of  matrix  multiplication.  Earlier,  we  derived  an  implementation 
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of  the  matrix  multiplication  algorithm  characterized  by  H^  =  (2,1,3)  and  S  =  (1, 1,-1).  We  see 
that: 

there  is  no  shifting  register  for      A  because      |  g/q  J  q\,   |  -1  =  0, 

there  is  one  :  B  because     |  5(j'o"oy   I  -  1  =  l^ 

there  are  two     shifting  registers  for     C  because     |  g/ooi'y   I  —  1  =  2. 
We  now  discuss  how  to  utilize  PEs  in  which  the  values  of  the  parameters  fej's  and  <,'s  are  not 
under  our  control,  and  we  have  to  design  an  array  within  these  parameters.   From  Lemma  3,  the 
equation 

1^  =  ^(6.  +  !)  (2) 

must  be  satisfied  for  each  i. 

Let  link  1  correspond  to  A,  link  2  correspond  to  B,  and  link  3  correspond  to  C.  Assume  now 
that  we  are  provided  with  a  PE  in  which  6i  =  0,  62  =  1.  63  =  1,  and  /j  =  1,  <2  =  1,  h  =  —I.  From 
(2),  the  following  must  be  satisfied: 

H^(O.l.O)'  ^  ^    H^(1.0,0)'  ^  2    ^^^  H^(0,0.1)'  ^  _2 
S2(0,l,0)'  '  S2(l,0,0)'         '  ^"     S2(0,0,l)' 

Let  H2  =  (2^,1.2r)  and  S2  =  (<^,1,— r).  From  Condition  1,  6,t  >  0,  from  Condition  2,  2Sx  +  y  + 
2tz  =  0  and  bx-\-  y  —  rz  —  Ocan  not  both  be  true  at  the  same  time,  where  — (n  —  1)  <  x,y,z  <  n—1. 
For  n  =  4.  we  have  —3  <  x,y.z  <  3.  Finally,  from  Condition  .5,  we  obtain  the  following  three 
inequalities:  1.  2(^j  + 2/  +  2r::  7^  (*>a- +  ?/ —  r^  (unless  (1,3/,^)' =  m(/i  for  some  m),  2.  26x  +  y  +  2TZ^ 
2tx  +  2y  -  2tz  (unless  {x,y,zY  =  mdi  for  some  m),  3.  2bx  -\-  y  -\-  2rz  ^  —2bx  -  2y  -\-  2tz  (unless 
{x,y,z)'  =  mdy,  for  some  vi).  We  can  let  b  =  \  and  r  =  2.  Then,  H2  =  (2, 1,  4)  and  83  =  (1,1,-2) 
satisfy  Theorem  10  and  the  resulting  mapping  is  shown  in  Fig-8. 

Similarly,  assume  that  we  are  provided  with  a  PE  in  which  61  =  0,  62  =  !•  ''3  =  0,  and  ii  =  1, 
^2  =  1,  <3  =  —  1.  From  (2),  the  following  must  be  satisfied: 

Hi(O.l.O)'  ^      Hi(1.0,0)'  ^  Hi(0,0,l)'^ 

83(0, 1,0)'  '83(1,0,0)'         '  ^'^     83(0,0,1)' 

Let  H3  =  (2(5,  l,r)  and  83  =  (<*>,  1,-r).  From  Condition  1,6,t  >  0,  from  Condition  2,  26x+y-irTz  — 
0  and  bx  ■\-y  —  Tz  =  Q  can  not  both  be  true  at  the  same  time,  where  —(n—1)  <  x,y,z  <  n  —  1.  For 
n  =  4,  we  have  —S<x,y,z<  3.  Finally,  from  Condition  5,  we  obtain  the  following  three  inequali- 
ties: 1.  2bx-{-y+Tz  -^  bx+y—Tz  (unless  (x,  y,  2)'  =  mdi  for  some  m),  2.  26x+y+Tz  ^  2bx+2y—2TZ 
(unless  (x,y,  z)*  —  md2  for  some  m ),  3.  2bx  +  y  -\-  rz  :^  —Sx  —  y  +  tz  (unless  (x,  y,  2)'  =  md^  for 
some  m).  We  can  let  b  =  3  and  r  =  2.  Then,  H3  =  (6,1,2)  and  S3  =  (3,1,-2)  satisfy  Theorem 
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10  and  the  resulting  mapping  is  shown  in  Fig-9. 

As  the  reader  can  see,  the  above  three  examples  (the  original  one  and  the  two  new  ones)  share 
the  parameters  6i  =  0,  62  =  1,  and  ti  =  1,  ^2  =  1,  <3  =  -1-  Thus,  they  are  all  special  cases  of  a 
family  of  solutions  defined  by 

H^(0,1,0)'       .    H^(1,0,0)'      ^        ,  H^(0,0,1)' 

S4(0.1,0r   =  ''  S,(1.0,0)'   =  ''  ^"'  54(0.0.1)'   =  -('^  +  '^-  ^'^ 

Families  of  solutions  were  studied  before,  for  instance  by  Varman  and  Ramakrishnan  [40].  We  can 
show  that  the  family  studied  by  them  is  a  special  case  of  the  family  defined  by  (3).  Specifically, 
let  us  further  restrict  (3)  by  imposing  the  condition  that  the  only  free  parameter  63  +  2  in  (3)  is 
restricted  being  prime.  Using  our  notation,  we  can  say  that  they  found  that  if  63  +  2  is  prime  then 
mapping  the  index  (i,  j,A:)'  to  /'£'5xi+j-(6+i)xfc  ^t  time  2  x  (5  X  t  +  j  +  (<5  +  1)  X  (63  +  1)  X  A:  is  a 
family  of  solutions,  where  6  =  l^-^f^],  and  (5  =  ^  +  1  if  63 +  2  divides  6,  otherwise  6  =  6.  Their 
solutions  can  be  obtained  by  specifying  H-"-   =  {26, 1,  {6  +  1)(63  +  1))  and  S'  =  (6, 1, -{6  +  1)). 

However,  (3)  admits  other  solutions  too,  for  instance  the  two  solutions  above:  (H2,S2)  and 
(113,53).  Recall,  that  formula  (2)  is  even  more  general  than  formula  (3)  and  thus  we  can  find 
solutions  satisfying  formula  (2)  that  do  not  belong  to  the  family  defined  by  formula  (3). 

Our  method  is  also  sufficiently  general  to  encompass  some  additional  results  obtained  by  other 
researchers  previously.  We  now  list  them  for  completeness: 

1.  If  we  let  Ha  =  (2,1,7?)  and  5a  =  (1,1,0),  then  we  obtain  the  algorithms  in  Kulkarni  and 
Yen  [13]  and  Kung  [17]. 

Note:  in  this  case,  the  tokens  of  C  are  fixed  in  the  PEs. 

2.  If  we  let  H^  =  (2. 1.  n  — 1)  and  S^  =  (1,1,-1),  then  we  obtain  the  algorithm  in  Ramakrishnan 
et  al.  [31]. 

3.  If  we  let  HJ  =  (26.1. t)  and  5c  =  (6,1. -t)  for  6  =  n  and  r  =  ^^  when  n  is  odd.  and  for 
6  —  n  -  1  and  7"  =  f  when  n  is  even,  then  we  obtain  the  algorithms  in  Ramakrishnan  and 
Varman  [32]. 

4.  (H^  ,S')  is  the  "Full-Pipeline"  algorithm  in  Varman  and  Ramakrishnan  [40]. 

5.  If  we  let  H^  =  (26,l,v)  and  5^  =  (^,1,0)  for  prime  v.  and  n  <  1/  <  2n,  and  there  are  \j] 
local  registers  for  C  in  each  PE,  then  we  obtain  the  "Partial-Pipeline"  algorithm  in  Varman 
and  Ramakrishnan  [40]. 

Note:  in  this  case,  the  tokens  of  C  are  fixed  in  the  PEs. 
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The  other  objectives  in  designing  the  array  may  include  the  minimization  of  the  execution  time 
or  the  number  of  PEs  used.  To  state  formally: 

1.  If  we  want  to  obtain  an  algorithm  with  minimum  execution  time,  then  we  add  the  constraint 

H^  =  min{max{|H^(72  -  /i)|  I  luh  €  /"}}• 
Hi 

2.  If  we  want  to  obtain  an  algorithm  with  minimum  number  of  PEs,  then  we  axid  the  constraint 

S  =  min{max{|S(/2  -  /i)|  |  /i,/2  €  /''}}• 
S 

Note  that,  of  course,  it  may  not  in  general  be  possible  to  satisfy  these  two  objectives  simultaneously. 
We  will  not  consider  in  this  paper  other  optimization  criteria  such  as  the  minimization  of 
PE/time  product  or  PE/time^  product  or  the  maximization  of  throughput  or  PEs'  utilization. 
These  types  of  optimization  generally  require  smaller  number  of  PEs,  more  local  registers  in  a  PE, 
and  more  I/O  ports.  It  is  possible  to  create  designs  taking  into  account  these  criteria;  however, 
they  will  not  be  modularly  extensible,  storage  optimal,  or  bounded  I/O.  In  addition,  if  the  number 
of  local  registers  for  each  data  link  i  is  greater  than  g'^  '  -  1,  this  requires  memory  addressing  and 
control  hardware,  because  of  not  purely  systolic  nature  of  the  computation. 

6      The  partition  model. 

Consider  the  case  when  we  are  given  a  specific  type  of  PE  suitable  for  designing  the  required 
linear  array  algorithm.  As  the  size  of  the  problem  to  be  solved  grows  (e.g.,  the  dimensions  of  the 
matrices  to  be  multiplied  increase),  we  wiU  need  more  and  more  PEs  to  construct  an  array  solving 
the  problem,  as  the  amount  of  storage  needed  increases.  What  should  we  do  then  if  the  number  of 
PEs  provided  to  us  is  limited?  This  problem  was  studied  extensively  too,  see  [7]  [9]  [28]  [2],  who 
provided  partition  methods  to  deal  with  it. 

The  goal  of  Guibas  et  al.  [7]  and  Hwang  et  al.  [9]  is  to  solve  the  problem  by  using  2-D  ar- 
rays with  limited  number  of  PEs.  They  concentrated  on  specific  algorithms.  Guibas,  Kung  and 
Thompson  [7]  studied  partitioning  for  dynamic  programming,  matrix  multiplication,  and  transi- 
tive closure.  Hwang  and  Cheng  [9]  proposed  a  set  of  four  primitive  chips  which  when  properly 
interconnected  could  solve  the  matrix  multiplication,  LU  decomposition,  inversion  of  nonsingular 
triangular  matrices,  and  the  solution  of  linear  systems  of  equation.  Moldovan  and  Fortes  [28]  con- 
sidered a  general  model  for  partitioning.  They  divided  the  index  space  /^  into  bands  according  to 
the  time  hyperplane,  assigned  an  ordering  to  execute  the  bands,  and  then  mapped  each  band  to  a 
fixed  (p  —  1)-D  array. 
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Annaratone  ei  al.  [2]  gave  three  partitioning  methods:  input  partitioning,  output  partitioning, 
and  pipelining.  These  three  methods  were  applied  to  the  WARP  computer.  Input  partitioning  and 
output  partitioning  are  not  systolic.  Pipelining  is  systolic;  however,  it  is  a  partial  pipeline  (some 
data  are  fixed  in  PEs  and  some  data  are  pipelined).  Partial  pipelining  algorithms  were  partitioned 
by  assigning  each  row  or  column  of  a  matrix  to  a  PE,  and  each  PE  then  performs  one  stage  of  the 
processing.  The  intermediate  result  of  one  PE  will  be  sent  to  the  next  PE  resulting  in  pipeline. 
Since  WARP  computer  has  only  a  1-D  WARP  array  with  ten  PEs,  the  algorithm  requires  many 
scans  for  a  512  x  512  matrix  multiplication  problem. 

We  will  also  study  partitioning  of  algorithms.  Suppose,  P  =  {(2i,j2, . . .  ,ip)'  \lj  <  ij  < 
Uj  and  Uj  —  Ij  =  0(n)  for  j  =  l,2,...,p}.  For  each  value  of  n.  we  require  a  linear  array  of 
some  storage  size.  If  we  are  given  an  array  with  smaller  storage,  we  need  to  partition  the  algo- 
rithm. We  will  show  how  to  do  that  if  all  the  data  streams  flow  in  the  same  direction.  In  the  sequel 
we  will  refer  to  the  algorithms  considered  so  far  as  unpartitioned. 

Suppose  then  that  for  problem  size  characterized  by  n  as  above,  we  need  an  A'^  =  N{n)  storage 
linear  array  and  requiring  T  time  for  an  unpartitioned  algorithm.  If  we  are  given  only  a  fc-storage 
linear  array  {ot  k  <  N,  then  we  wiU  consider  feeding  the  data  streams  iN/k]  times  into  this  k- 
storage  linear  array.  This  approach  is  illustrated  in  Fig-10.  In  Fig  10-a,  if  the  data  streams  need  to 
be  fed  into  the  A'^-storage  linear  array  once  only,  then  in  Fig  10-b,  the  data  streams  will  be  fed  into 
the  fc-storage  linear  array  m  —  [A7^"l  times.  Thus,  the  new  algorithm  can  be  naturally  divided 
into  m  phases. 

Since  T  time  is  required  to  feed  the  data  streams  into  the  linear  array  once,  the  total  execution 
time  is  0{TN/k).  Next,  suppose  that  this  Ar-storage  linear  array  contains  q  PEs  and  some  index  / 
is  executed  in  PE^j  in  the  unpartitioned  algorithm.  Then,  I  will  be  executed  in 
■f-^((S/-min{S/i|/i6/p}+i)mod,)  '"  ^he  partitioned  algorithm  (a  modi  G  {1,2, .. .  ,6}).  Furthermore, 
it  is  executed  at  time  H^7  in  phase  [(S7  -  min{S/i|/i  6  P]  +  I)/?].  We  state  the  above  as  the 
following  condition: 

Condition  6:  If  all  the  data  streams  have  the  same  direction,  i.e.  for  all  d,,  Sd,  >  0,  and  q  is  the 
number  of  PEs  in  the  available  linear  array,  then  for  the  partition  algorithm  (Hq,Sq).- 

•  H^7  =  Hi/in  phase  [(S/ -  min{S/i|/i  £  I^]  +  l)/9l,and 

•  Sq/  =  (S7  -  min{S/i|/i  e  /P)  +  1)  mod  q. 

Note:  This  condition  applies  if  we  do  not  utilize  any  information  about  special  properties  of 
the  specific  problem  and  its  unpartitioned  solution  that  could  hold  for  some  special  cases.  Thus, 
it  is  applicable  to  general  p  nested  loop  algorithms.  We  are  studying  some  important  problems  for 
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which  advantage  can  be  taken  of  the  problem's  structure.  They  are,  however,  beyond  the  scope  of 
this  paper. 

As  an  example,  we  will  now  present  a  family  of  unpartitioned  matrix-multiplication  algorithms, 
which  can  be  easily  partitioned. 

As  before,  let  link  1  correspond  to  A,  link  2  correspond  to  B,  and  link  3  correspond  to  C. 
Suppose  that  we  are  restricted  to  PEs  characterized  by  6i  =  0,  6i  =  1,  and  t^  =  t^  =  tj  =  I. 
We  want  to  derive  conditions  on  63.  From  Condition  6,  Sd,  >  0,  from  Condition  1,  H-^cf;  >  0, 
and  from  Conditions  3  and  4,  ^/'  =63+1.  Therefore,  we  must  choose  H^  =  (2(5,l,r(63  -\-  1)) 
and  S  =  (M,r),  where  (5  >  0  and  r  >  0.  Next,  from  Condition  5',  ^  7^  f  and  \  ^  liktii  and 
lih±ll  ^  ^^  implying  (63  +  1)  >  2.  or  63  >  2.  Then,  from  Condition  2,  2Sx  +  y  +  7(63  +  l)r  =  0 
and  62-  +  y  +  t:  =  0  can  not  both  be  true  at  the  same  time  (where  —{n  -  1)  <  x.y,z  <  n  —  I). 
Finally,  from  Condition  5,  we  obtain  the  following  three  inequalities: 

—  2/ +  r(63  -  1)^  7^  0     (unless  (x^y.zY  =  mdi  ior  some  m)  (4) 

6x  +  rb^z  :/^  0      (unless   (x,2/,2)'  =  md2  for  some  m)  (5) 

^(63  -  l)j  +  631/ 7^  0     (unless   (x,  t/,^)*  =  777^3  for  some  m)  (6) 

To  find  simple  solutions  for  6  and  r  so  that  (4)-(6)  are  satisfied,  we  can  require  ^ 

r(63-l)>77gcd(r(63-l),l)        ■  (7) 

Tb3>ngcd{Tb3J)  (8) 

6(b3-l)>ngcd(S(b3-l).b3)  .  (9) 

In  particular,  if  63  is  prime,  then  first  let  6  =  \^^'].  Second,  if  63  divides  b,  then  let  (5  =  ^  -f  1 
and  T  =  h:  otherwise  let  ^  =  ^  and  t  -  b  -\-  I. 

Such  a  choice  of  b  and  r  will  satisfy  (7),  (8),  and  (9),  because  gcd(r63,^)  =  gcd(r(63  -  1),  1)  = 
gcd((^(63  -  1),63)  =  1  and  rbj^rib^,  -  1),  8(bz  -  1)  >  n.  Therefore,  based  on  this  choice,  (H^,S)  is 
a  family  of  algorithms  for  which  all  data  streams  flow  in  the  same  direction.  In  addition,  both  the 
time  complexity  and  the  total  size  of  the  storage  including  the  number  of  PEs  and  the  number  of 
the  local  registers  in  each  PE  are  0[n'^).  To  create  the  partitioned  version  of  these  algorithms  is 
straightforward  and  we  ommit  the  details  from  this  presentation. 


^Varman  and  Ramaknshnan  [40]  showed  that  if  i,  y,  n.  a  and  8  are  integers  such  that  n  >  0,  0  <  |x|,|j/|  <  n, 
\<.a<n,  0>n  gcd(a,  /?),  then  qx  ^  0y. 
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7     Two  Path-Finding  Algorithms 

Path-finding  problems  '1]  include  the  problems  to  compute  the  transitive  closure  and  all  shortest 
paths.  Kung  [19]  and  Rote  [34]  also  show  that  nonsingular  matrix-inversion  problem  is  a  special 
case  of  the  path-finding  problem.  For  2-D  systolic  arrays  to  solve  path-finding  problems,  consult 
[7]  [34]  [23]  [19]. 

Guibas,  Kung  and  Thompson  [7]  first  proposed  a  three-pass  mesh-connected  systolic-array 
algorithm  for  the  path-finding  problem.  This  algorithm  is  very  important  for  pragmatical  reasons, 
because,  the  data  streams  of  each  pass  are  the  same  as  that  of  the  matrix  multiplication  algorithm 
described  in  Section  3.  Therefore,  we  can  construct  a  three-pass  linear-array  algorithm  for  the 
path-finding  problem  based  on  their  algorithm,  which  both  is  modularly  extensible  and  provides  a 
partitioned  algorithm.  Varman  and  Ramakrishnan  [37]  also  gave  a  solution  for  this  method. 

However,  in  this  section,  we  want  to  synthesize  a  linear-array  algorithm  directly  from  the 
Warshall- Floyd  path-finding  algorithm,  because  it  is  of  independent  interest  for  the  following  rea- 
sons: (1)  this  algorithm  has  five  data-dependence  vectors  instead  of  three  data-dependence  vectors"^ 
in  [7],  (2)  this  linear-array  algorithm  needs  only  one  pass  of  data  streams  in  unpartitioned  algorithm 
instead  of  three  passes  in  the  algorithm  based  on  [7], 

A.  The  Warshall-Floyd  path-finding  algorithm. 

The  transitive  closure  problem:  Given  an  n  x  n  Boolean  matrix  of  elements  C[?,  j]  over  an  n 
node  graph,  with  C[i,j]  being  1  if  there  is  an  arc  from  node  i  to  node  j  or  i  =  j,  and  0,  otherwise. 
Then,  determine  whether  there  is  a  path  from  node  i  to  node  j  for  all  z,j. 

The  all-shortest-path  problem:  Given  an  n  X  n  distance  matrix  of  elements  D[i,j]  over  an  n 
node  graph,  with  D[i,j]  being  the  length  of  the  arc  from  node  i  to  node  j,  where  D[i,j]  >  0  and 
D[i,i]  —  0  for  all  i,j.  Then,  determine  what  is  the  shortest  distance  from  node  i  to  node  j  for  all 

i,  j- 

The  transitive  closure  and  the  all  shortest  paths  can  be  obtained  by  the  Warshall-Floyd  algo- 
rithm, where  both  input  and  solution  will  be  in  C: 

for  k,  i,j  =  1  to  7? 

c[ij]  =  c[i,j]e{C[i,k]^c[k,j]) 

end  _for 

The  above  notation  is  a  convenient  representation  for  the  3  nested  loop  algorithm.  ©  represents 
the  logical  V  (min)  and  (g)  represents  the  logical  A  {  +  )  in  the  transitive-closure  problem  (all- 
shortest-path  problem),  respectively. 

After  adding  indices  to  each  variable,  the  algorithm  becomes: 

Because  there  are  three  data-dependence  vectors  in  matrix-multiplication  algorithm. 
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for  k,  i,j  =  1  to  ri 
if;  =  1 

thenC(>'-'-^^[i,k]  =  C^^'-'''j">[i\j'] 
else  ^('■■••■•''[i.  ^-]  =  C('''''-'-^'[?,  /I-]; 
if?  =  l 

else  CC^'-'J'!^-,;]  =  C('^''-i'J)[fc,;]; 

end_for 

There  are  five  data-dependence  vectors: 

di  =        (0,0,1)'  for  iC^'''''^^[i,klC^'''''^-^^[uk]) 

^2=        (0,1,0)'  for  {C^''''^^[kJ],C^''''-''^^[k,j]) 

d3=        (1.0.0)'  for  (C<*'''J)[i,j].C('^-i-''j)[?,;]) 

d4=  (l.O.j-kY  for  (C''='''j)[i,it],C('^'''''j')[i',j']) 

(/s  =  (l.?-/t,0)'  for  (C('^'''j)[il%i].C('''''''j''[j',i']) 

We  have  0^4,  because  C[j:,^-]  =  C[i'J'],  i  =  i',  k  =  /,  and  k'  =  /r  -  1,  thus  ik,i,jY  -  {k'.i'J'Y  = 
(1,0,  j  -  k^.  Similarly  we  have  dr,,  because  C[k,j]  -  C[i',j'],  k  -  i',  j  =  j',  and  k'  =  k  -  1, 
thus  {k,i,jY  -  {k'.i'.j'Y  =  (1,«  -  ^-.0)'.  Although  there  is  only  one  input  stream  of  C,  say  C[i,j] 
corresponding  to  ^3,  during  the  execution  it  will  create  four  data  streams  corresponding  to  rfj,  ^2, 
^4,  and  ds,  respectively.  The  data-dependence  graph  for  n  =  3  is  shown  in  Fig-11.  In  this  graph, 
it  is  easy  to  see  that  (^4  and  ^5  are  not  constant.  Since  we  cannot  accommodate  non-constant 
speed  in  a  data  stream,  we  have  to  fix  the  corresponding  two  data  streams  in  PE's.  If  we  let 
S5C?4  =  S^d^  -  0.  it  imphes  S5  =  (0,0,0),  that  is,  all  the  indices  will  be  mapped  into  PEq. 
Therefore,  we  can  not  pipeline  this  algorithm  in  the  linear  array.  Thus  the  algorithm  must  be 
modified. 

B.  A  reindexed  path-finding  algorithm. 

In  transitive-closure  problem,  we  have  the  following  properties:  First,  there  is  always  a  path  from 
node  i  to  node  i  (i.e.  initially.  C[i.i]  -  1),  and  second,  C''[i,j]  >  C'[i,j]  if  k  >  /,  where  CliJ]  =  1 
means  that  there  is  a  path  from  node  i  to  node  j  passing  through  intermediate  nodes  from  {1,2, 
...,  p]  only,  and  C^[i.j]  =  0,  otherwise.  (Similarly  in  all-shortest-path  problem:  First  D[i,i]  =  0 
because  the  shortest  distance  from  node  i  to  node  i  is  zero,  and  second,  Z)'^[i,  j]  <  D'[i,j]  if  k  >  /, 
where  D'^[i,j]  denotes  the  shortest  distance  from  node  i  to  node  j  passing  through  intermediate 
nodes  in  {1,  2,  . . . ,  9}  only.) 
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Now  let  us  consider  the  following  reindexed  transitive-closure  algorithm,  which  Kung  et  al.  [19] 
have  mapped  to  a  2-D  spiral  array  and  a  2-D  orthogonal  array: 

for  /c,  i,_7  =  1  to  n 

C[i  -\-  k  -  I  mod  n.j  +  k-  1  mod  n]  = 

C[i  +  ^-  -  1  mod  n,j+k-l  mod  n]  © 

iC[i  +  k  -  1  mod  n,k]l^C[kJ  +  k  -  1  mod  n]) 
endJbr       /*  Note:  /  mod  n  G  {1,2,  ...,n}  */ 

Based  on  the  properties  C[i,i]  =  1  and  C''[i,j]  >  C'[i,j]  for  k  >  /,  we  can  pipeline  each  variable 
and  add  the  index  to  each  variable  to  obtain: 

for  k,i,j  =  1  to  n 
if  f  =  n 

if  J-'  =  n 

then  C<''"'"'"'[7i  +  k  -  \  mod  n.7i  +  ^'  -  1  mod  n]  =  1 
else  C*''''"'-''[n  +  ^-  -  1  mod  nj+k-l  mod  n]  = 

C^'^'-^'j')[k'J'  +  k'-  1  mod«]; 
elsejf  j  =  n 

then  C''^'''")[?  +  A:  -  1  mod  n.n  +  k-  1  mod  n]  = 

C('''''''"'[«'  +  ^-'-lmodn,A:'] 
else  (?(*''•'"''[!  +  A-  -  1  mod  n,  j  +  Ar  -  1  mod  n]  = 

C(A-'.:'./)[,'  +  ;t'  _  1  mod  nj'  +  k'  -1  mod  n]; 

if  j  -  1 

then  C(''"-''^)[i  +  k-  I  mod  n,A-]  = 

C('=-''i)[i  +  A-  -  1  mod  n,  1  +  A  -  1  mod  n] 
else  C(''''-''['  +  A-  -  1  mod  n,A-]  = 

C<''--«'J-i)[)  + A-  1  mod  n,k]\ 
if  z  =  1 

then  C('^'i'J)[A-,  j  +  A  -  1  mod  n]  = 

C('''i'J)[l  -f  A-  -  1  mod  n,  j  +  A  -  1  mod  n] 
else  C(''"'''J)[^',  J  +  A  -  1  mod  n]  = 

C(t,.-i,j)[;t,j-fA-  1  modn]: 
C(>='''J)[i  +  A-  -  1  mod  n.  j  +  A-  -  1  mod  n]  = 

(7('.-,'.j)[f  +  ^.  _  1  mod  n,i  +  A  -  1  mod  n]  © 
C('=.'.J)[y  +  A  -  1  mod  n,A-]®  C'*'''J)[^%  J  +  't  -  1  mod  n] 
endJbr       /*  Note:  ?' mod  n  G  {1,2,  ...,7?}  */ 
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After  labelling,  there  are  also  five  data-dependence  vectors:  cfj  =  (0,0. 1)'  for  (C^'''''^^[i+  k-  I  mod 
n,fc],C('='''J-i)[t  +  fc  -  1  mod  n,fc]),  dj  =  (0,1,0)'  for  {C^''''''^[kJ  +  k-\  mod  n],C^''''-'^-^^[kJ  + 
/t  -  1  mod  n]),  da  =  (1,-1,-1)'  for  (C'''"''-')[(  +  A:  -  1  mod  nj  +  k  -  1  mod  n],  C''-'''''-'')[«'  + 
k'  -  1  mod  nj'  +  k'  -  1  mod  n]),  ^4  =  (1,-1,0)'  for  (C<'^'''"'[i  +  Z^'  -  1  mod  Ti,n  -\-  k  -  1  mod  n], 
CC^''-'.")!?'  +  fc'  -  1  mod  n,/L-'])  and  ^5  =  (1,0,-1)'  for  (C^'^-^-^'in  +  k  -  I  mod  n,;  +  fc  -  1  mod 
n],C^^''^'^">[k',j'  +  k'-l  mod  n]). 

We  have  cfa,  because  C[j  + fc—  1  mod  n,j  +  k-l  mod  «]  =  C[i'  +  k'  -  1  mod  n,j'  +  k'—  1  mod  n], 
j  +  A'  -  1  mod  7}  =  i'  +  k'  —  1  mod  n,  j  +  k  -  \  mod  n  =  j'  +  A;'  -  1  mod  n,  and  A:'  =  A;  —  1,  thus 
{k,i,jy  —  {k',i',j'y  =  (1,-1.-1)'.  Similarly  we  have  d^,  because  C[i  +  A-  —  1  mod  n,n  +  k  — 
1  mod  n]  =  C[i'  +  A-'  —  1  mod  n.  A:'],  /  -|-  A;  —  1  mod  n  —  i'  -{■  k'  —  \  mod  n,  n  -f  A;  —  1  mod  n  =  A;',  and 
A''  =  A--1,  thus  {k,i,jy-{k'J'J'y  =  (1,-1,0)'.  Finally,  wehav^ds,  because  C[n  + A'- 1  mod  n,j+ 
k  —  1  mod  J}]  =  C[k',  j'  +  k'  —  1  mod  n],  n  +  k  —  1  mod  n  —  k',  j  -\-  k  —  1  mod  n  —  j'  +  k'  —  1  mod  n, 
and  A-'  =  A-  -  1.  thus  (k^i.jY  -  (A-',  ?',/)'  =  (1,0,-1)'.  The  data-dependence  graph  for  n  =  3 
is  shown  in  Fig- 12.  Unlike  in  the  WarshaU-Floyd  algorithm,  all  the  data-dependence  vectors  are 
constant. 

Following  our  method,  we  obtain  Hg  =  (3.1,1),  and  if  we  let  Sq  =  (0.1.1).  Hg  =  (a,6.c), 
then  from  Condition  1,  we  have  a,b,c  >  0,  and  then  from  Condition  3',  we  have  f  =  cj,  j-  =  C2, 
"'^2^  =  C3,  ^f^  =  C4  and  ^f^  =  C5,  where  all  of  the  Ci  to  C5  are  integers,  after  that,  from  Condition 
5',  we  have  c,  ^  Cj  for  all  i  /  j  and  1  <  ?',  j  <  5.  Then,  modifying  Hg  =  (3, 1, 1),  we  let  6  =  2  and 

c  =  1.  After  that,  from  Conditions  2  and  5.  ^^[l'l'%  /  1,  2,  ^.  ^,  ^,  for  (x,2/,-)*  ^  h  -  h 
and  I2  —  I\  ^  md,.  Then  we  obtain  the  following  five  inequalities:  ^ 

ax -\- y  ^  0      (unless  (x,  J/,  5)'  =  m(fi  for  some  m)  (10) 

ax  -  c  /  0      (unless  (a-,?/,  r)' =  r77(/2  for  some  m)  (11) 

a{2.T  +  y  +  :)  ^  (z  -  y)      (unless  (x,  ?/,  2)'  =  md^  for  some  m)  (12) 

-  a{x  +  y)  ^  (a  -  l)z      (unless  {x,y,zY  =  rnd^  for  some  m)  (13) 

-  a(x  +  z)  ^  {a  +  l)y     (unless  (i,?/,  2)' =  mds  for  some  m)  (14) 

Finally,  if  a  >  n,  o  >  2(n  -  1)  and  a  is  odd,  we  can  get  (a, 2, 1)  as  a  solution  for  Hg. 

As  an  example,  for  n  =  3.  let  a  =  5.  Then  the  data-flow  diagram  of  the  hnear-array  algorithm 
Hg  =  (5,2, 1)  and  Sg  =  (0, 1, 1)  is  shown  in  Fig- 13.  This  mapping  maps  (k,i,jY  to  step  (5A:-|-2?-l-j) 
and  executes  it  in  PE,+j.  The  foDowing  describes  how  the  linear  array  works: 

1.  Each  PE  has  five  data  links,  data  links  1  and  2  flow  from  left  to  right,  data  links  3,  4,  and 
5  flow  from  right  to  left.   In  addition,  there  are  no  shifting  registers  for  data  links  1  and  3, 
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one  shifting  register  for  data  link  2,  two  shifting  registers  for  data  link  4,  and  three  shifting 
registers  for  data  link  5. 

2.  Only  one  input  stream  C  is  fed  into  the  array.  It  satisfies  Corollary  5  for  the  entrance  time 
and  is  fed  into  data  link  3  from  right  to  left. 

3.  In  each  index  execution  we  need  three  data  tokens:  23  {C[i  +  A;  -  1  mod  n,j  +  k  -  1  mod  n]), 
zi  (C[i  +  k  -  I  mod  n,k]).  z-2  {C[k,j  +  k  -  I  mod  n])  and  generate  a  new  result  r  (C[i  +  k  - 
1  mod  n,j  +  k—l  mod  n]).  We  will  use  x  <—  /,  to  represent  that  the  PE  gets  x  from  the  data 
link  i  and  will  use  y  -*  Ij  to  represent  that  the  PE  puts  y  onto  the  data  link  j.  Then,  when 
k  —  1,  we  have  the  following  functions  in  each  cell  when  performing  the  index  (i,j,  fc)': 


3  = 

1 

1  <J 

<  n 

J  = 

n 

2l    ^   ^3 

r-li 

zi  -  h 

h^h 

Zl^h 

h^h 

i  =  1 

-2   —  -3 

r-h 

~2   *—  ~3 

r^h 

22  ^  23 

r^h 

Z2  -  /3 

r~l3 

^3   ~  ^3 

r^h 

Z3  ^  h 

r-^h 

Z\    ^  -3 

r^h 

^^1    -  /l 

h  -h 

z^  -  /i 

h-U 

1  <  i  <  n 

22   ^  h 

h-h 

-2  *—  h 

h^h 

22  —  h 

h^h 

^3  ^  ^3 

r-h 

Z3  ~  h 

r-h 

Z3  —  h 

r^h 

^1    —  ~3 

r~h 

z,  -  h 

'1-/1 

zi  -  h 

h^U 

I  —  n 

-2  —  h 

h  -  h 

22   ^  h 

'2-/5 

22  —  h 

h-*k 

Z3  -  I3 

r-h 

Z3  —  '3 

r-/3 

Z3  ^  h 

r-h 

When  2  <  A;  <  n,  we  have  the  following  functions  in  each  cell  when  performing  the  index 


J  = 

1 

1  <J 

<  n 

3  = 

n 

^1  ^  23 

r-h 

21  -  /i 

h  -h 

21  -  h 

h-h 

i  =  1 

-2    —  ^3 

r~h 

22  —  23 

r-h 

22  —  23 

r-h 

Z3  -  h 

r-h 

23  -  /3 

r^h 

23  -  u 

r^h 

z\  —  23 

r-h 

21    -h 

h-h 

21    -h 

h-h 

1  <  t  <  rz 

22   -  h 

h-h 

22   —  h 

h-h 

22   —  h 

h-h 

Z3  -  h 

r-h 

23  -  /3 

r-h 

23  -  h 

r-h 

z\  ^  23 

r-h 

21    -  h 

h  -h 

21    -  h 

h-h 

I  =  n 

22   ^  h 

h-h 

22   ^  h 

h-h 

22   -  h 

h-h 

23  ^  h 

r-^h 

23  ^  h 

r-h 

Z3-  1 

r-h 

4.  The  output  results  also  appear  in  data  link  3. 
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The  time  complexity  of  this  algorithm  is  0(n'^),  the  number  of  the  PEs  used  is  0{n).  In  each 
PE,  we  need  0(n)  shifting  registers  for  data  links  3,  4  and  5,  requiring  the  total  storage  of  0(7i^). 
Since,  in  this  linear-array  algorithm  the  number  of  the  local  registers  in  each  PE  is  a  function  of 
the  problem  size  n,  this  algorithm  is  not  modularly  extensible. 

We  have  also  found  that  Hy  =  (a',  3, 1)  and  S7  =  (3, 1, 1)  can  be  easily  partitioned,  where  a' 
>  24(n  —  1),  a'  >  2n  +  9,  and  a'  is  odd.  However,  (H^iSy)  is  too  not  modularly  extensible.  We 
ommit  the  details  from  this  presentation. 

8      Conclusions 

A  systematic  method  for  synthesizing  linear-array  implementations  from  nested  loop  algorithms 
has  been  presented  in  this  paper.  The  method  was  based  on  a  set  of  formal  necessary  and  sufficient 
conditions  on  the  sequential  algorithm  and  feasible  transformations  that  were  listed  in  the  paper. 
This  method  can  be  used  down  to  register  level. 

An  important  contribution  of  this  paper  is  the  classification  of  data-dependence  vectors  based 
on  the  "Zero-One-Infinite"  property.  Having  this  classification,  we  can  understand  the  characters 
of  the  tokens  in  data  streams.  It  also  allows  us  to  formulate  conditions  on  the  target  linear  array. 
That  is,  we  can  determine  whether  tokens  can  be  destroyed  or  not,  whether  tokens  can  be  pipelined 
or  not,  and  whether  a  PE  needs  additional  I/O  ports  or  not. 

We  also  find  tight  bounds  on  time  complexity  of  linear  array  implementations  for  an  impor- 
tant class  of  algorithms.  Specifically,  time  complexity  =  ©(storage  complexity).  Such  algorithms 
include  matrix  multiplication,  L-U  decomposition,  inversion  of  nonsingular  triangular  matrix,  ma- 
trix orthogonal  triangularization,  a  version  of  transitive  closure  algorithm  [7],  a  version  of  Discrete 
Fourier  Transform  (DFT)  [24],  bubble  sort,  and  others.  We  also  provide  a  technique  to  determine 
whether  linear  array  implementations  of  these  algorithms  have  both  the  optimal  time  complexity 
and  optimal  storage  complexity. 

We  studied  a  partition  model  for  designing  algorithms  that  can  be  partitioned.  Our  method 
for  partitioning  is  quite  general,  although  we  need  to  impose  the  condition  that  aU  data  streams 
flow  in  the  same  direction. 

Synthesis  of  families  of  linear  array  implementation  for  matrix  multiplication  algorithms  was 
used  to  illustrate  our  method.  We  were  thus  able  to  show  the  interplay  between  the  formal 
conditions  listed  and  various  properties  of  the  linear  array  implementations.  In  addition  we  used 
the  method  to  synthesize  a  linear  array  algorithm  for  a  variant  of  the  Warshall-Floyd  algorithm 
[19]. 

As  in  this  paper  we  restricted  ourselves  to  designing  linear  arrays  with  constant  I/O,  we  did 
not  consider  the  case  where  Srf,  =  0.  Our  methodology  can,  however,  be  extended  to  handle  this 
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case  too.  After  doing  that,  we  can  synthesize  all  1-D  time  hyperplane  and  one-wavefront  1-D 
space  hyperplane  mappings  to  systolic  implementations.  That  is,  if  the  data  streams  of  systolic 
linear  array  implementations  "do  not  change"  directions  during  the  execution,  then  we  also  can 
synthesize  those  implementations. 

Our  method  can  be  used  in  two  ways.  It  can  transform  a  p  nested  loop  algorithm  onto  a  new 
linear  array  implementation.  It  also  can  synthesize  a  linear  array  implementation  by  listing  basic 
constraints  on  a  PE  first.  Thus  our  method  contributes  towards  the  utilization  of  programmable 
linear  arrays.  That  is,  in  the  model  of  general  Sd,,  we  can  design  only  one  type  of  a  PE  with 
appropriate  number  of  data  links  and  local  registers,  which  is  subsequently  used  to  synthesize 
a  single  linear  array  that  can  solve  matrix  multiplication,  L-U  decomposition,  triangular  linear 
system,  triangular  matrix  inversion,  DFT,  convolution,  longest  common  subsequence,  transitive 
closure,  matrix-vector  multiplication,  bubble  sort,  and  others. 

As  our  method  is  able  to  synthesize  large  families  of  feasible  linear  array  implementations,  a 
software  tool  analogous  to  that  of  Moldovan's  [29]  could  be  used  to  help  in  selecting  implementations 
optimizing  additional  criteria. 

Finally,  the  problem  of  transforming  the  algorithms  with  the  non-constant  data  dependencies 
to  linear  array  algorithms  is  still  not  satisfactorily  solved,  even  though  it  may  be  possible  to  handle 
them  on  an  ad  hoc  basis  as  we  did  for  the  Warshall-Floyd  algorithm. 

9      Appendix  :  Summary  of  the  Method 

In  this  appendix,  we  summarize  our  mapping  method  as  a  procedure  with  seven  steps.  In  what 
follows,  the  algorithm  model  is  Ag  =  (/'',  V'^p,  F^p).  We  use  symbol  d,  to  denote  the  i-th  data- 
dependence  vector;  7,  /i.  and  I2,  the  indices  in  the  index  set  P  —  {(21,12^  •  •  •  7  v)*l'i  -  *j  - 
Uj  for  j  =  l,2,...,p};  H^  =  H  -|-  IT,  where  H  is  Lamport's  hyperplane,  11  is  the  assistant 
hyperplane  related  to  H  .  and  H  is  the  hyperplane  for  1-D  linear  array;  S,  the  space  hyperplane. 
The  linear-array  model  is  Ar  =  (M ,  K  ,Tat,  B  at  ,  Fat)\  U-b,  denotes  the  direction  and  the  number 
of  the  shifting  registers  in  each  PE  of  the  data  link  i,  respectively. 

Procedure 

Input  :  A  p  nested  loop  algorithm  Ag  —  [P^VAg^FAg)- 

Output:  A  linear-array  algorithm  (H-^.S).  Where. 

H-^  is  a  1-D  time  hyperplane,  and  S  is  a  1-D  space  hyperplane. 
Input  or  Output:  A  linear  array  Ar  =   (M^K^TatiBatiFat)  can  be 

either  as  an  input  for  restricted  interconnection  primitives 

or  as  an  output  of  a  new  linear  array. 
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1.  Index  all  variables  in  the  algorithm,  that  is,  eliminate  broadcasting  of  data  and  add  the 
missing  index  to  each  variable. 

2.  Find  the  set  of  data-dependence  vectors.  Each  data-dependence  vector  d,  will  correspond  to 
data  link  i  of  the  linear  array. 

3.  Compute  H^  and  S. 

We  use  a  heuristic  method: 

First,  find  Lamport's  hyperplane  H^  =  (0-1,02,  ■■■  ,oip).  such  that  H  d,  >  0  for  all  data- 
dependence  vectors,  and  it  satisfies  H     =  minjj{max{|H(72  —  /i)|  |  I\.h  €  ^^}}- 

Second,  let  the  space  mapping  S  =  {si,S2,-  ■  ■  ,Sp),  where  Si,S2,  ■  •  • ,  and  Sp  are  variables  with 
values  to  be  determined  later. 

Third,  let  the  assistant  hyperplane  11  =  (tti,  7r2, . . .  ,7rp),  where  7ri,7r2,  ...,  and  TTp  are  also 
variables  with  values  to  be  determined  later. 

Fourth,  let  H^  =  H^  -|-  n  =  (qi  +  7ri,Q2  -f  7r2, . . .  ,Qp -|-  TTp). 

(Note:   W'e  found  that  it  is  easy  to  derive  H     by  modifying  Lamport's  H    ,  other  methods 

for  finding  H'^  independently  are  possible.) 

Fifth,  define  the  relations  between  H  and  S.  In  order  to  preserve  the  data-dependence 
relations  and  to  relate  them  to  the  linear-array  model,  H  and  S  must  satisfy  the  following 
constraints: 

(a)  H^d,  >  0  for  every  data-dependence  vector  rf,. 

(b)  If  7]  and  I2  are  two  indices  of  P,  then 

for  {Pj  -  qj)  <  Xj  <  (qj  -  Pj )  and  j  =  I p. 

(c)  if  d,  7^  mdj  and  dj  ^  md,  for  any  m,  then    g^  '   7^     g ^  ^ . 

(d)  If  (/2  -  A)  7^  md,  for  any  m,  then  W^(l2-  Ii)Sd,  /  S(/2  -  h)n'^d,. 

(e)  g^*^'  must  be  an  integer  for  every  data-dependence  vector  d,. 

(f)  If  there  are  6,  shifting  registers  in  each  PE  for  the  data  link  i,  then  6,  =  |  g^*^'  |  -  1. 

(g)  If  Sdj  >  0  then  t,  =  1  and  the  data  stream  i  will  be  fed  into  the  linear  array  at 
^•E^min{S/i|/ie/p}-  ^^  ^^'  ^  ^  ^^^^'^  ^'  ~  ~^  ^^^  ^^^^  data  stream  i  wiU  be  fed  into  the 
linear  array  at  ■P.Ejnax{S/2|/2e/p}-  (Note:  If  Sd,  —  0  then  i,  =  0  and  the  data  stream  will 
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be  fixed  in  the  PEs.  However,  in  this  paper,  we  chose  not  to  consider  this  case,  although 
it  is  a  simple  extension.) 

Note:    (a)-(d)  guarantee  no  data  conflicts  in  the  data  flow;  (e)-(g)  relate  the  linear-array 
algorithm  (H-^,S)  to  the  linear-array  model  Ar. 

4.  Determine  the  trade-ofi"s  between  the  time,  the  number  of  PEs,  and  the  number  of  the  local 
shifting  registers  in  each  PE. 

(a)  Given  a  space  mapping  S  =  {si,S2,.  ■  ■  iSp),  H^  must  satisfy  all  of  the  constraints  in 
step  3. 

(b)  Given  the  number  of  the  shifting  registers  of  the  PE,  from  constraints  (e)-(g)  in  step  3, 
H^  and  S  must  satisfy,  ^g^  =  ^(6,  -|-  1). 

(c)  If  we  want  to  obtain  an  algorithm  with  minimum  execution  time,  then  we  add  the 
constraint  H^  =  minjji{max{|Hl(/2  -  /i)|  |  h,  h  G  /''}}• 

(d)  If  we  want  to  obtain  an  algorithm  with  minimum  number  of  PEs,  then  we  add  the 
constraint  S  =  ming{max{|S(/2  -  /i)|  |  /1./2  G  I^}]- 

Note  that,  of  course,  it  may  not  in  general  be  possible  to  satisfy  the  two  objectives  (c)  and 
(d)  simultaneously. 

5.  Map  the  algorithm  into  the  linear  array: 

•  The  number  of  PEs  is  M  =  max{|S(/2  -  /i)|  |  luh  £  I"}  +  1- 

Note:  Although  the  PEs  are  numbered  here  from  min{S/i|/i  G  P}  to  max{S72|-^2  £  I^}^ 
we  can  renumber  them  from  1  to  M . 

•  The  execution  time  is  T  =  0(max{|Hl(/2  -  /i)|  |  I^h  G  P}  +  M(l  +  E^i  bi)). 

•  Define  the  entrance  time  of  each  data  token: 

Suppose  a  variable  x.  whose  data-dependence  vector  is  tf,  of  type  1,  is  used  in  I  and 

enters  PE^j  at  time  H^/. 

If  Sd,  >  0.  then  x  is  fed  into  the  linear  array  at  /'£inin{S/i|7i€/p}  ^^  time 

Hi/  -  (S7  -  min{S/il/i  G  Z"})^- 
If  Sd,  <  0,  then  x  is  fed  into  /'-E'niax{S/2|/2  6/p}  ^t  time 

H^/  -  (max{S/2|/2  G  F}  -  S/)^-^. 

hdi 
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•  We  can  execute  the  function  Fat  =  f^Ag  in  P^SI  ^^  time  H^I. 

6.  Consider  partitioning  of  the  algorithm,  when  the  size  of  the  array  required  for  the  unparti- 
tioned  algorithm  is  larger  than  that  of  the  available  linear  array: 

If  all  of  the  data  streams  have  the  same  direction,  i.e.  for  all  rf,,  Sd,  >  0,  and  q  is  the  number 
of  PEs  in  the  available  linear  array,  then  we  have  a  partitioned  algorithm  (Hi,  Sq)  as  follow: 

•  Hl/:^  H^/ in  phase  ["(S7- min{S/i|/i  € /P}  + l)/^].  and 

•  Sq/=  (S7-  min{S7i|/i  e  P}  +  1)  mod  9.  (a  mod  be  {1,2, . . .  ,6}.) 

7.  Analyze  the  performance. 
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Fig    7.  The  data-flow  diagram  of  H'-  =  (2,  1.  2)  and  S  =  (1, 1,-2) 
Where  C03  collides  to  Cjo  and  C13  collides  to  C30. 
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Fig    9    Tht  diUflo-di.jr.mofnJ  =(6,1.2).nd  S  =  (3.I.-2) 
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Fig.  10-a.  N  =  mk  and  data  streams  are  fed  into 
the  A^-storage  array  only  once. 
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Fig    11.  The  data-dependence  graph  for  the  Warshall-FIoyd  algorithm. 
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Fig.   12.  The  data-dependence  graph  for  the  reindexed  Warshall-Floyd  algorithm. 
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Fig    13    The  dala-Rowdiagrani  for  hJ  =  (5,2,1)  and  Sg  =  (0.1,1), 
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